From thomas.bub at thomson.net  Fri Dec  1 01:12:14 2006
From: thomas.bub at thomson.net (Bub Thomas)
Date: Fri, 1 Dec 2006 10:12:14 +0100
Subject: [openib-general] Is an umad_close_port a good idea after I
 disconnect from the SA with osm_vendor_delete ?
Message-ID: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>

Sasha,
I'm having trouble to get the patch applied.
I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE
path but after running the ofed-install script the sources in the
/usr/local/ofed din't contain that patch anymore.
Can you help me out of the dark and tell me how to build the
libvendor.so out of/on the ofed-1.1/SOURCES tree.
Thanks
Thomas


> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> Sent: Monday, November 27, 2006 5:43 PM
> To: Bub Thomas
> Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen
> Subject: Re: [openib-general] Is an umad_close_port a good idea after
I
> disconnect from the SA with osm_vendor_delete ?
> 
> On 14:13 Mon 27 Nov     , Bub Thomas wrote:
> >
> > Sasha,
> > whom to ask to add this to the osm_vendor functions?
> 
> Please test this patch:
> 
> diff --git a/osm/libvendor/osm_vendor_ibumad.c
> b/osm/libvendor/osm_vendor_ibumad.c
> index e82695f..4205b23 100644
> --- a/osm/libvendor/osm_vendor_ibumad.c
> +++ b/osm/libvendor/osm_vendor_ibumad.c
> @@ -545,10 +545,15 @@ osm_vendor_delete(
>  	umad_receiver_t *p_ur;
>  	int agent_id;
> 
> -	/* unregister UMAD agents */
> -	for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)
> -		if ( (*pp_vend)->agents[agent_id] )
> -			umad_unregister( (*pp_vend)->umad_port_id,
agent_id );
> +	if ((*pp_vend)->umad_port_id >= 0) {
> +		/* unregister UMAD agents */
> +		for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;
agent_id++)
> +			if ( (*pp_vend)->agents[agent_id] )
> +
umad_unregister((*pp_vend)->umad_port_id,
> +						agent_id );
> +		umad_close_port((*pp_vend)->umad_port_id);
> +		(*pp_vend)->umad_port_id = -1;
> +	}
> 
>  	clear_madw( *pp_vend );
>  	/* make sure all ports are closed */
> 
> 
> > Or should I file a bug for this
> 
> Good idea too.
> 
> Sasha


From dotanb at dev.mellanox.co.il  Fri Dec  1 01:20:52 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Fri, 1 Dec 2006 11:20:52 +0200 (IST)
Subject: [openib-general] QP creation failure
In-Reply-To: <456F7239.4060104@systemfabricworks.com>
References: <456F7239.4060104@systemfabricworks.com>
Message-ID: <2340.85.65.224.142.1164964852.squirrel@dev.mellanox.co.il>

Hi.


> Hi,
> I'm hoping someone here can help me diagnose a this problem.
> I have a really simple test app that uses verbs and is failing to create
> a QP on one machine in particular.  On other machines the app works and
> behaves as expected without any problems.
>
> The machine in question is  32bit dual CPU Intel system running FC4 and
> the released OFED 1.1 with a Mellanox PCI-X HCA (MT23108)
> [root at localhost test]# uname -a
> Linux localhost.localdomain 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2
> 23:08:39 EDT 2005 i686 i686 i386 GNU/Linux
> [root at localhost test]# cat /usr/local/ofed/BUILD_ID
> OFED-1.1
>
> openib-1.1 (REV=9905)
> # User space
> https://openib.org/svn/gen2/branches/1.1/src/userspace
> Git:
> ref: refs/heads/ofed_1_1
> commit a083ec1174cb4b5a5052ef5de9a8175df82e864a
>
> The code in question is pretty simple and as I've said works everywhere
> else I've tried it.
>
> Errno is set to 22, and I've traced the problem to this point in the
> OFED stack, so I can see where it fails but still have no idea why:
> It fails at line 578 in "src/userspace/libibverbs/src/cmd.c" the
> instruction is
> 'write(pd->context->cmd_fd, cmd, cmd_size) != cmd_size'
> cmd_fd looked valid (was 6), cmd looked to point to a valid structure,
> and cmd_size was 96.
>
> This was called from line 533 of src/userspace/libmthca/src/verbs.c
> 'ibv_cmd_create_qp(pd, &qp->ibv_qp, attr, &cmd.ibv_cmd, sizeof cmd, &resp,
> sizeof resp);'
>
> Which was invoked by my code calling ibv_create_qp as seen below:
>
> <snip>
>  /* create the qpairs */
>                         init_attr.send_cq = info->cq_hndl;
>                         init_attr.recv_cq = info->cq_hndl;
>                         init_attr.cap.max_send_wr  = info->oust_wr_sq; //8
>                         init_attr.cap.max_recv_wr  = info->oust_wr_rq; //8
>                         init_attr.cap.max_send_sge = info->sg_size_sq; //1
>                         init_attr.cap.max_recv_sge = info->sg_size_rq; //1
>                         init_attr.cap.max_inline_data = 1024;
>                         init_attr.qp_type = IBV_QPT_RC;
>
>                         if ((info->qp_hndl[CLIENT] =
> ibv_create_qp(info->pd_hndl, &init_attr)) == NULL) {
>
>                                 info->failed = 1;
>                                 rc = ERR_INIT_HCA_FAILED;
>
>                         }
> </snip>
>
>
> Any ideas or pointers in the right direction would be greatly appreciated.

I think that the problem is the amount of inline data that you try to use.
I suggest that you put 0, create the QP and check the value that are being
returned from the QP creation and use it.

I believe that the maximum size that can be used in this attribute is ~ 420 .


Dotan


From dotanb at dev.mellanox.co.il  Fri Dec  1 01:27:41 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Fri, 1 Dec 2006 11:27:41 +0200 (IST)
Subject: [openib-general] Segmentation fault on ib_read_bw
In-Reply-To: <d2ad857f0611301639p4f59011dj5897186ae80807ae@mail.gmail.com>
References: <d2ad857f0611301639p4f59011dj5897186ae80807ae@mail.gmail.com>
Message-ID: <2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il>

Hi.

> Hi,
>
> Im using the openib gen2 trunk and was running the performance tests
> from that tree.
> I get a "Segmentation Fault" on running ib_read_bw and the remaining
> tests.
> The output is as follows:
> ------------------------------------------------------------------
>                     RDMA_Read BW Test
> Connection type : RC
> Segmentation fault
>
> Any particular reason why this is happening?

Can you give some more info, such as:

which driver git/svn version are you using?
which parameters did you use in each side?
which distro are you using?
which computer arch are you using?

thanks
Dotan


From or.gerlitz at gmail.com  Fri Dec  1 05:36:41 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 1 Dec 2006 15:36:41 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1164918691.14800.101.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
Message-ID: <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>

On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:
> > So what did you change since v1?  How do you deal with fitting 64-bit
> > addresses into an sg list entry that has a 32-bit dma_addr_t?

> The ipath_map_sg() handler for ib_dma_map_sg() doesn't store
> anything in the struct scatterlist.  The translation is
> done when ipath_sg_dma_address() is called which now
> returns u64 instead of dma_addr_t thus avoiding the truncation
> problem.

And there is this open/TODO of calling kmap(page) on dma mapping time
(or when ipath_sg_dma_address is called) and kunmap(page) on dma
unmapping time, where you must store the kvaddr between the two calls
and the sg does not have a room for it where dma_addr_t is u32 and
kvaddr is u64 ....

> All of the callers to ib_dma_map_single(), ib_dma_map_page(),
> and ib_sg_dma_address() have been modifed to save the address
> in a u64 instead of a dma_addr_t.  This actually wasn't much
> of a change since the address was being cast to u64 anway
> when assigned to struct sge.addr.

Its fixes a bug, so it actually somehow much of a change. Without it
on arch as mentioned above, ipath_dma_map_single would return only a
u32 portion of the kvaddr and later the ulp code would place this
chopped address in sge.addr and the ipath driver would use the wrong
address.

Or.


From sashak at voltaire.com  Fri Dec  1 06:19:01 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 Dec 2006 16:19:01 +0200
Subject: [openib-general] Is an umad_close_port a good idea after I
 disconnect from the SA with osm_vendor_delete ?
In-Reply-To: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>
References: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>
Message-ID: <20061201141901.GC23574@sashak.voltaire.com>

Hi Thomas,

On 10:12 Fri 01 Dec     , Bub Thomas wrote:
> Sasha,
> I'm having trouble to get the patch applied.
> I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE
> path but after running the ofed-install script the sources in the
> /usr/local/ofed din't contain that patch anymore.
> Can you help me out of the dark and tell me how to build the
> libvendor.so out of/on the ofed-1.1/SOURCES tree.

Never did it personally, but you may want to look at
https://openib.org/tiki/tiki-index.php?page=OFED+Support
for how ofed_patch.sh does this.

And you can use svn or git versions of management/osm as well.

Sasha

> Thanks
> Thomas
> 
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > Sent: Monday, November 27, 2006 5:43 PM
> > To: Bub Thomas
> > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen
> > Subject: Re: [openib-general] Is an umad_close_port a good idea after
> I
> > disconnect from the SA with osm_vendor_delete ?
> > 
> > On 14:13 Mon 27 Nov     , Bub Thomas wrote:
> > >
> > > Sasha,
> > > whom to ask to add this to the osm_vendor functions?
> > 
> > Please test this patch:
> > 
> > diff --git a/osm/libvendor/osm_vendor_ibumad.c
> > b/osm/libvendor/osm_vendor_ibumad.c
> > index e82695f..4205b23 100644
> > --- a/osm/libvendor/osm_vendor_ibumad.c
> > +++ b/osm/libvendor/osm_vendor_ibumad.c
> > @@ -545,10 +545,15 @@ osm_vendor_delete(
> >  	umad_receiver_t *p_ur;
> >  	int agent_id;
> > 
> > -	/* unregister UMAD agents */
> > -	for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)
> > -		if ( (*pp_vend)->agents[agent_id] )
> > -			umad_unregister( (*pp_vend)->umad_port_id,
> agent_id );
> > +	if ((*pp_vend)->umad_port_id >= 0) {
> > +		/* unregister UMAD agents */
> > +		for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;
> agent_id++)
> > +			if ( (*pp_vend)->agents[agent_id] )
> > +
> umad_unregister((*pp_vend)->umad_port_id,
> > +						agent_id );
> > +		umad_close_port((*pp_vend)->umad_port_id);
> > +		(*pp_vend)->umad_port_id = -1;
> > +	}
> > 
> >  	clear_madw( *pp_vend );
> >  	/* make sure all ports are closed */
> > 
> > 
> > > Or should I file a bug for this
> > 
> > Good idea too.
> > 
> > Sasha
> 
> 


From halr at voltaire.com  Fri Dec  1 06:27:16 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 09:27:16 -0500
Subject: [openib-general] Is an umad_close_port a good idea after I
 disconnect from the SA with osm_vendor_delete ?
In-Reply-To: <20061201141901.GC23574@sashak.voltaire.com>
References: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>
	<20061201141901.GC23574@sashak.voltaire.com>
Message-ID: <1164983140.11808.177662.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:
> Hi Thomas,
> 
> On 10:12 Fri 01 Dec     , Bub Thomas wrote:
> > Sasha,
> > I'm having trouble to get the patch applied.
> > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE
> > path but after running the ofed-install script the sources in the
> > /usr/local/ofed din't contain that patch anymore.
> > Can you help me out of the dark and tell me how to build the
> > libvendor.so out of/on the ofed-1.1/SOURCES tree.
> 
> Never did it personally, but you may want to look at
> https://openib.org/tiki/tiki-index.php?page=OFED+Support
> for how ofed_patch.sh does this.
> 
> And you can use svn or git versions of management/osm as well.

There's currently no git version of OFED 1.1 OpenSM AFAIK.

-- Hal

> Sasha
> 
> > Thanks
> > Thomas
> > 
> > 
> > > -----Original Message-----
> > > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > > Sent: Monday, November 27, 2006 5:43 PM
> > > To: Bub Thomas
> > > Cc: Tziporet Koren; openib-general at openib.org; Erez Cohen
> > > Subject: Re: [openib-general] Is an umad_close_port a good idea after
> > I
> > > disconnect from the SA with osm_vendor_delete ?
> > > 
> > > On 14:13 Mon 27 Nov     , Bub Thomas wrote:
> > > >
> > > > Sasha,
> > > > whom to ask to add this to the osm_vendor functions?
> > > 
> > > Please test this patch:
> > > 
> > > diff --git a/osm/libvendor/osm_vendor_ibumad.c
> > > b/osm/libvendor/osm_vendor_ibumad.c
> > > index e82695f..4205b23 100644
> > > --- a/osm/libvendor/osm_vendor_ibumad.c
> > > +++ b/osm/libvendor/osm_vendor_ibumad.c
> > > @@ -545,10 +545,15 @@ osm_vendor_delete(
> > >  	umad_receiver_t *p_ur;
> > >  	int agent_id;
> > > 
> > > -	/* unregister UMAD agents */
> > > -	for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS; agent_id++)
> > > -		if ( (*pp_vend)->agents[agent_id] )
> > > -			umad_unregister( (*pp_vend)->umad_port_id,
> > agent_id );
> > > +	if ((*pp_vend)->umad_port_id >= 0) {
> > > +		/* unregister UMAD agents */
> > > +		for (agent_id = 0; agent_id < UMAD_CA_MAX_AGENTS;
> > agent_id++)
> > > +			if ( (*pp_vend)->agents[agent_id] )
> > > +
> > umad_unregister((*pp_vend)->umad_port_id,
> > > +						agent_id );
> > > +		umad_close_port((*pp_vend)->umad_port_id);
> > > +		(*pp_vend)->umad_port_id = -1;
> > > +	}
> > > 
> > >  	clear_madw( *pp_vend );
> > >  	/* make sure all ports are closed */
> > > 
> > > 
> > > > Or should I file a bug for this
> > > 
> > > Good idea too.
> > > 
> > > Sasha
> > 
> > 


From swise at opengridcomputing.com  Fri Dec  1 06:35:28 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 01 Dec 2006 08:35:28 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov>
References: <754FC8FE0A97A94B906344259F447D4A0413F811@ES23SNLNT.srn.sandia.gov>
	<456E991C.4040907@dev.mellanox.co.il>
	<CE1EA5D4-D3B5-42E7-B168-6ECD43018852@cisco.com>
	<1164904991.7247.44.camel@trinity.ogc.int>
	<3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com>
	<936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com>
	<1164911424.11779.46.camel@stevo-desktop>
	<E3BBE216-C1AA-4CFD-88F9-C63D95779BE5@cisco.com>
	<1164916697.11779.84.camel@stevo-desktop>
	<5251E729-5FC0-48B5-9399-0C9466F8A2A2@cisco.com>
	<1164917426.11779.87.camel@stevo-desktop>
	<754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov>
Message-ID: <1164983728.6872.5.camel@stevo-desktop>

On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote:
> Steve,
> 
> As you know, I have my rnfs kernel running the stable iwarp-stack on
> my cluster now.  But how do I compile the userspace packages from that
> stack? 
> 
You build and install the userspace libraries from the iwarp stable
branch.  This will install all the needed header files to build other
packages that depend on them.  Like mvapich2-0.9.8, for instance.

If rping is working for you, then you've already done this.  The user
libs and header files are all installed in /usr/local by default.  If
you have /usr/local/include/rdma/rdma_cma.h, for instance, you've
probably already installed the userspace stuff from the iwarp stable
branch.

To build and install the user libs from the iwarp branch, please see the
wiki howto.  There is a section describing installing the userspace
libraries.

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Hope this helps...


Steve.


From swise at opengridcomputing.com  Fri Dec  1 06:40:00 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 01 Dec 2006 08:40:00 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1164949057.19459.11.camel@localhost>
References: <754FC8FE0A97A94B906344259F447D4A0413F811@ES23SNLNT.srn.sandia.gov>
	<456E991C.4040907@dev.mellanox.co.il>
	<CE1EA5D4-D3B5-42E7-B168-6ECD43018852@cisco.com>
	<1164904991.7247.44.camel@trinity.ogc.int>
	<3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com>
	<936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com>
	<1164911424.11779.46.camel@stevo-desktop>
	<1164949057.19459.11.camel@localhost>
Message-ID: <1164984000.6872.10.camel@stevo-desktop>

On Thu, 2006-11-30 at 20:57 -0800, Matt Leininger wrote:
> On Thu, 2006-11-30 at 12:30 -0600, Steve Wise wrote:
> > On Thu, 2006-11-30 at 12:12 -0500, Jeff Squyres wrote:
> > > It just clicked in my brain as to why you were asking this question.
> > > 
> > > Remember that OMPI currently does not use any CM for OF connections  
> > > at all.  So it's not like it's using the old CM that doesn't support  
> > > iWARP.  OMPI uses its own out-of-band mechanism, which, as I  
> > > understand it, should work with iWARP just as well as it works for IB.
> > > 
> > > Am I incorrect in thinking that?  (I have no iWARP hardware to test  
> > > with)
> > 
> > iWARP _requires_ the RDMA-CM for connection setup...
> > 
> > So OMPI as it stands today won't work over iwarp devices.
> > 
> > Right now, the only non-uDAPL MPI solution that will work with the iwarp
> > stable svn branch + 2.6.17 RNFS is MVAPICH2.
> > 
> > If you utilize uDAPL, then Intel and HP have MPI libs that might work...
> 
>   OMPI also has a uDAPL network device (along with a device that uses
> verbs directly).  So if we just use OMPI uDAPL it should work over
> iWarp?
> 

It should.  You might have to tweak OMPI slightly to work with uDAPL
from the iWARP branch.  Or take the latest uDAPL and back-port it to the
iwarp branch. 

Steve.


From halr at voltaire.com  Fri Dec  1 07:18:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 10:18:40 -0500
Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_sm.c: In
 osm_sm_mcgrp_join, use CL_PLOCK_RELEASE macro
Message-ID: <1164986311.11808.179322.camel@hal.voltaire.com>

OpenSM/osm_sm.c: In osm_sm_mcgrp_join, use CL_PLOCK_RELEASE macro
rather than calling cl_plock_release directly

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c
index 9aa4a36..100f2a0 100644
--- a/osm/opensm/osm_sm.c
+++ b/osm/opensm/osm_sm.c
@@ -740,7 +740,7 @@ osm_sm_mcgrp_join(
    status = osm_port_add_mgrp( p_port, mlid );
    if( status != IB_SUCCESS )
    {
-      cl_plock_release( p_sm->p_lock );
+      CL_PLOCK_RELEASE( p_sm->p_lock );
       osm_log( p_sm->p_log, OSM_LOG_ERROR,
                "osm_sm_mcgrp_join: ERR 2E03: "
                "Unable to associate port 0x%" PRIx64 " to mlid 0x%X\n",


From sashak at voltaire.com  Fri Dec  1 07:30:55 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 1 Dec 2006 17:30:55 +0200
Subject: [openib-general] Is an umad_close_port a good idea after I
 disconnect from the SA with osm_vendor_delete ?
In-Reply-To: <1164983140.11808.177662.camel@hal.voltaire.com>
References: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>
	<20061201141901.GC23574@sashak.voltaire.com>
	<1164983140.11808.177662.camel@hal.voltaire.com>
Message-ID: <20061201153055.GE23574@sashak.voltaire.com>

On 09:27 Fri 01 Dec     , Hal Rosenstock wrote:
> On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:
> > Hi Thomas,
> > 
> > On 10:12 Fri 01 Dec     , Bub Thomas wrote:
> > > Sasha,
> > > I'm having trouble to get the patch applied.
> > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE
> > > path but after running the ofed-install script the sources in the
> > > /usr/local/ofed din't contain that patch anymore.
> > > Can you help me out of the dark and tell me how to build the
> > > libvendor.so out of/on the ofed-1.1/SOURCES tree.
> > 
> > Never did it personally, but you may want to look at
> > https://openib.org/tiki/tiki-index.php?page=OFED+Support
> > for how ofed_patch.sh does this.
> > 
> > And you can use svn or git versions of management/osm as well.
> 
> There's currently no git version of OFED 1.1 OpenSM AFAIK.

What about 1.1 git branch? This is same as SVN's 1.1. :)

Sasha


From Arkady.Kanevsky at netapp.com  Fri Dec  1 07:29:41 2006
From: Arkady.Kanevsky at netapp.com (Kanevsky, Arkady)
Date: Fri, 1 Dec 2006 10:29:41 -0500
Subject: [openib-general] [openfabrics-iwg] OFED 1.2 contents and
 schedule as proposed by the EWG
Message-ID: <C98692FD98048C41885E0B0FACD9DFB803531AD4@exnane01.hq.netapp.com>

What about iWARP support?

Arkady Kanevsky                       email: arkady at netapp.com
Network Appliance Inc.               phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.        Fax: 781-895-1195
Waltham, MA 02451                   central phone: 781-768-5300
 

> -----Original Message-----
> From: Bill Boas [mailto:bboas at systemfabricworks.com] 
> Sent: Thursday, November 30, 2006 2:06 PM
> To: 'OPENIB'; openib-promoters at openib.org; 
> openfabrics-iwg at openfabrics.org
> Cc: 'Tziporet Koren'; 'Jeff Squyres'; 'EWG'
> Subject: [openfabrics-iwg] OFED 1.2 contents and schedule as 
> proposed by the EWG
> 
> Following the Developer Summit discussions in Tampa the EWG 
> is proposing the contents and schedule for OFED 1.2 as 
> described on their wiki
> 
> https://openib.org/tiki/tiki-index.php?page=OFED+release+procedure
> 
> Many members of the OpenFabrics Board could not be present at 
> the summit and many members of the OpenFabrics community were 
> also not present.
> 
> Also the IWG is planning for its next Interoperability Test 
> Event after which it is probable that the OpenFabrics Logo 
> program should be in effect.
> 
> Please review this proposal from the EWG carefully to ensure that if:-
> 
> 1) you represent your company in the OpenFabrics community 
> that your company's product needs in the spring and early 
> summer of 2007 will be met by OFED 1.2 as proposed;
> 
> 2) you are a customer or end user that may wish to deploy 
> OFED 1.2 after its release and distribution that it looks 
> like it will contain what you need for your installations by then;
> 
> 3) you are working for a Linux distribution then the 
> schedule, process and testing  planned by the EWG and the IWG 
> meet your requirements and schedule;
> 
> 4) your interests do not align with the 3 identified above 
> but you are also planning to use OFED 1.2 please speak up and 
> give the community feedback.
> 
> Any other feedback or comments are welcome.
> 
> In my role in the Alliance I'd like to thank Tziporet, Jeff, 
> Nimrod, Aviram, Bob, Hal, Sean, Tom, Or, Betsy, Roland, 
> (please forgive me if I left out your name)and everyone who 
> has been working in the EWG for their tremendous individual 
> contributions to the Alliance and kernel software.
> 
> Bill Boas
> VP, Business Development | System Fabric Works 
> bboas at systemfabricworks.com | 510-375-8840
> 
> 
> -----Original Message-----
> From: openfabrics-ewg-bounces at openib.org
> [mailto:openfabrics-ewg-bounces at openib.org] On Behalf Of 
> Tziporet Koren
> Sent: Thursday, November 30, 2006 6:06 AM
> To: EWG
> Cc: OPENIB
> Subject: [openfabrics-ewg] reminder: OFED 1.2 meeting next Monday
> 
> Hi All,
> I wish to remind all that we have the EWG meeting on Monday 
> 4-Dec at 9am-10am.
> Jeff already sent all details.
> 
> Agenda: close OFED 1.2 features after each owner approve that 
> the schedule can be met (meaning code complete on end of January)
> 
> See also
> https://openib.org/tiki/tiki-index.php?page=OFED+release+proce
dure for details on the features.
> 
> Tziporet
> 
> _______________________________________________
> openfabrics-ewg mailing list
> openfabrics-ewg at openib.org
> http://openib.org/mailman/listinfo/openfabrics-ewg
> 
> 
> 
> _______________________________________________
> openfabrics-iwg mailing list
> openfabrics-iwg at openfabrics.org
> https://openfabrics.org/mailman/listinfo/openfabrics-iwg
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From adit.262 at gmail.com  Fri Dec  1 07:35:38 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Fri, 1 Dec 2006 10:35:38 -0500
Subject: [openib-general] Segmentation fault on ib_read_bw
In-Reply-To: <2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il>
References: <d2ad857f0611301639p4f59011dj5897186ae80807ae@mail.gmail.com>
	<2462.85.65.224.142.1164965261.squirrel@dev.mellanox.co.il>
Message-ID: <d2ad857f0612010735g6d279ac8ta77ea73c6c8fe6fb@mail.gmail.com>

I managed to get the test working .. I just restarted the server and
it was working..
Im actually doing some work with the Xen VMM and Infiniband..
I have setup 2 servers (Pentium D - x86_64 arch) with red hat
enterprise linux 4 and Xen 3 VMM running .. The IB driver seems to be
working on dom0 in any case and I can do all of the perf tests.
I wanted to know if there was any correlation between the QoS setup
done using openSM and the perf tests i.e. if I configure QoS in
opensm.opts should I be seeing marked differences in the BW from the
perf tests?
Is there any kind of documentation that gives an idea how the BW can
change for diff QoS params?

Regards,
Adit

On 12/1/06, dotanb at dev.mellanox.co.il <dotanb at dev.mellanox.co.il> wrote:
> Hi.
>
> > Hi,
> >
> > Im using the openib gen2 trunk and was running the performance tests
> > from that tree.
> > I get a "Segmentation Fault" on running ib_read_bw and the remaining
> > tests.
> > The output is as follows:
> > ------------------------------------------------------------------
> >                     RDMA_Read BW Test
> > Connection type : RC
> > Segmentation fault
> >
> > Any particular reason why this is happening?
>
> Can you give some more info, such as:
>
> which driver git/svn version are you using?
> which parameters did you use in each side?
> which distro are you using?
> which computer arch are you using?
>
> thanks
> Dotan
>
>


-- 
Adit Ranadive
Freshman,
Georgia Institute of Technology,
Atlanta, GA


From halr at voltaire.com  Fri Dec  1 07:41:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 10:41:53 -0500
Subject: [openib-general] Is an umad_close_port a good idea after I
 disconnect from the SA with osm_vendor_delete ?
In-Reply-To: <20061201153055.GE23574@sashak.voltaire.com>
References: <B79FAF8BB536314E859EA1963CFFD222029AC57C@wdtssmail01.eu.thmulti.com>
	<20061201141901.GC23574@sashak.voltaire.com>
	<1164983140.11808.177662.camel@hal.voltaire.com>
	<20061201153055.GE23574@sashak.voltaire.com>
Message-ID: <1164987703.11808.180039.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 10:30, Sasha Khapyorsky wrote:
> On 09:27 Fri 01 Dec     , Hal Rosenstock wrote:
> > On Fri, 2006-12-01 at 09:19, Sasha Khapyorsky wrote:
> > > Hi Thomas,
> > > 
> > > On 10:12 Fri 01 Dec     , Bub Thomas wrote:
> > > > Sasha,
> > > > I'm having trouble to get the patch applied.
> > > > I patched the source file in the ofed-1.1 distribution tgz'ed the SOURCE
> > > > path but after running the ofed-install script the sources in the
> > > > /usr/local/ofed din't contain that patch anymore.
> > > > Can you help me out of the dark and tell me how to build the
> > > > libvendor.so out of/on the ofed-1.1/SOURCES tree.
> > > 
> > > Never did it personally, but you may want to look at
> > > https://openib.org/tiki/tiki-index.php?page=OFED+Support
> > > for how ofed_patch.sh does this.
> > > 
> > > And you can use svn or git versions of management/osm as well.
> > 
> > There's currently no git version of OFED 1.1 OpenSM AFAIK.
> 
> What about 1.1 git branch? This is same as SVN's 1.1. :)

I sit corrected...

-- Hal

> Sasha


From halr at voltaire.com  Fri Dec  1 08:18:05 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 11:18:05 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E25B@EPEXCH2.qlogic.org>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E25B@EPEXCH2.qlogic.org>
Message-ID: <1164989866.11808.181175.camel@hal.voltaire.com>

On Thu, 2006-11-30 at 17:41, Todd Rimmer wrote:
> > From: Roland Dreier [mailto:rdreier at cisco.com]
> > Sent: Thursday, November 30, 2006 5:32 PM
> > To: Todd Rimmer
> > Cc: openib-general at openib.org
> > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue
> > 
> >  > Proposed solution:
> >  > - add an IPoIB configuration parameter.  This parameter could
> redirect
> >  > the Solicited Node Multicast traffic to the IPv6 All Nodes
> multicast
> >  > address (IB GID 0xff01601B.....0000001)
> > 
> > This is silly however.  For one thing you are now not following the
> > RFC, and compliant IPv6 over IPoIB stacks will send neighbour
> > discovery messages to the solicited node address, so they won't be
> > received since the node didn't join.
> > 
> > There's no requirement that a SM assign a unique MLID to each
> > multicast group.  The obvious solution to the problem is simply that
> > the SM reuse MLIDs for solicited node multicast groups, perhaps even
> > collapsing all of them down to 1 MLID.
> > 
> 
> I think its worth discussing a number of alternatives.  I'm not sure
> there is an ideal solution here.
> 
> Doesn't an SM based solution produce other complications?
> - Such as the SM/SA must maintain an extremely large list of Multicast
> Member records (potentially N^2).

Certainly O(N) groups where N is the number of IPv6 hosts (and each
group is 1 or more MCMs).

> - Host nodes will be joining N multicast groups and maintaining
> membership in them (potentially further stressing the SA, etc)

Do all IPv6 nodes join all the solicited node groups ? I don't see this
occuring (so far) on the subnets I have seen.

> Not to mention that the SM would then need to know about IPoIB GID
> addressing conventions (which seems like a violation of network layers,
> etc).

There's already the IPv6 signature as part of the MGID to help with this
layering violation. Some SMs already do things with this already.

-- Hal

> Todd Rimmer
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Fri Dec  1 08:20:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 11:20:15 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061130230136.GB32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
Message-ID: <1164989940.11808.181179.camel@hal.voltaire.com>

On Thu, 2006-11-30 at 18:01, Jason Gunthorpe wrote:
> On Thu, Nov 30, 2006 at 05:29:16PM -0500, Hal Rosenstock wrote:
> 
> > > IPV6 defines that each node will have a Solicited Node Multicast
> > > address.  This address is unique per node and is constructed from the
> > > IPV6 unicast address of the node.  (see RFC 2373 for more details).
> > > 
> > > IP over IB defines that IPV6 multicast addresses map to IB multicast
> > > GIDs in a one to one manner.
> > > 
> > > IB defines a multicast address space limit of 4095 LIDs.
> > 
> > actually it is 16K-1
> 
> For IPv6 only the lower 24 bits of each assigned IPv6 address are
> used to construct a solicited node multicast in the range 
> FF02::1:FF00:0/104. The Solicited Node Multicast address it not
> expected to be uniquely subscribed.

Any idea on how many would subscribe ? What does this depend on ?

> > MGIDs are different from MLIDs. Multiple MGIDs can be mapped onto a
> > single MLID if the characteristics are the same. Is that the case for
> > the IPv6 groups ?
> 
> The solicited node multicast feature is intended for scalability by
> having the switching core prune ND queries. It is OK if the multicast
> goes to more nodes than subscribe to it (this happens on cheap
> ethernet switch gear without multicast support anyhow).

And a similar thing is accomodated within IB. With limited MFT space,
the collapse of multiple (similar) MGRPs (MGIDs) on a single MLID is
seems important (and reduces some of the scalability issues Todd
mentioned in terms of IPv6).

> I think the thing to do here is for the SM to have an option to
> compress a particular MGID range (using a hash of some kind). Ie
> configure so that all of IPv6 FF02::1:FF00:0/104 will use at most 16
> MLIDs.

Yes, that is one strategy which seems reasonable to me.

> That way the site can select that some MGID's get mapped directly to
> MLIDs and others get shared to save LID space.
> 
> Then if you still run out it can randomly combine MGIDs into MLIDs.

Yes, that's another wrinkle.

-- Hal

> Jason


From robert.j.woodruff at intel.com  Fri Dec  1 09:04:42 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 1 Dec 2006 09:04:42 -0800
Subject: [openib-general] openMPI for 2.6.17.10 kernel
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C013F662B@orsmsx418.amr.corp.intel.com>

Matt wrote,
>  OMPI also has a uDAPL network device (along with a device that uses
>verbs directly).  So if we just use OMPI uDAPL it should work over
>iWarp?

>  - Matt

This should just work. (famous last words). 
For OFED 1.2, since the iWarp support will be in the base
kernel (2.6.19), it should be easier to test to make sure that uDAPL
works both over IB and iWarp as expected. Once this is tested and
any issues fixed, Intel MPI, HPMPI, and OMPI (if it has a uDAPL driver)
should all work over iWarp in addition to IB. 

woody


From bos at pathscale.com  Fri Dec  1 09:13:11 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Fri, 01 Dec 2006 09:13:11 -0800
Subject: [openib-general] [PATCH 0 of 2] Add memcpy_cachebypass,
 a memcpy that doesn't cache reads
In-Reply-To: <20061130213820.5ed22d81.akpm@osdl.org>
References: <patchbomb.1164843307@eng-12.pathscale.com>
	<20061130213820.5ed22d81.akpm@osdl.org>
Message-ID: <457062A7.20504@pathscale.com>

Andrew Morton wrote:
> The name memcpy_cachebypass() doesn't tell us whether it bypasses caching
> on the source, the dest or both.  It'd be nice if it did.
>   
Yep, I'll fix that and resubmit.

    <b


From hycsw at sandia.gov  Fri Dec  1 09:18:42 2006
From: hycsw at sandia.gov (Chen, Helen Y)
Date: Fri, 1 Dec 2006 10:18:42 -0700
Subject: [openib-general] openMPI for 2.6.17.10 kernel
References: <754FC8FE0A97A94B906344259F447D4A0413F811@ES23SNLNT.srn.sandia.gov>
	<456E991C.4040907@dev.mellanox.co.il>
	<CE1EA5D4-D3B5-42E7-B168-6ECD43018852@cisco.com>
	<1164904991.7247.44.camel@trinity.ogc.int>
	<3EF52E87-47E9-4F0C-AA0D-C2CAA63DFC7C@cisco.com>
	<936E0840-D941-4BB4-A3C4-CE410D90E0E5@cisco.com>
	<1164911424.11779.46.camel@stevo-desktop>
	<E3BBE216-C1AA-4CFD-88F9-C63D95779BE5@cisco.com>
	<1164916697.11779.84.camel@stevo-desktop>
	<5251E729-5FC0-48B5-9399-0C9466F8A2A2@cisco.com>
	<1164917426.11779.87.camel@stevo-desktop>
	<754FC8FE0A97A94B906344259F447D4A0413F81D@ES23SNLNT.srn.sandia.gov>
	<1164983728.6872.5.camel@stevo-desktop>
Message-ID: <754FC8FE0A97A94B906344259F447D4A0413F825@ES23SNLNT.srn.sandia.gov>

Thanks,
 
Helen

________________________________

From: Steve Wise [mailto:swise at opengridcomputing.com]
Sent: Fri 12/1/2006 7:35 AM
To: Chen, Helen Y
Cc: Jeff Squyres; openib-general at openib.org; Leininger, Matthew L
Subject: RE: [openib-general] openMPI for 2.6.17.10 kernel


On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote:
> Steve,
>
> As you know, I have my rnfs kernel running the stable iwarp-stack on
> my cluster now.  But how do I compile the userspace packages from that
> stack?
>
You build and install the userspace libraries from the iwarp stable
branch.  This will install all the needed header files to build other
packages that depend on them.  Like mvapich2-0.9.8, for instance.

If rping is working for you, then you've already done this.  The user
libs and header files are all installed in /usr/local by default.  If
you have /usr/local/include/rdma/rdma_cma.h, for instance, you've
probably already installed the userspace stuff from the iwarp stable
branch.

To build and install the user libs from the iwarp branch, please see the
wiki howto.  There is a section describing installing the userspace
libraries.

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Hope this helps...


Steve.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/f91dc902/attachment.html>

From jgunthorpe at obsidianresearch.com  Fri Dec  1 10:37:17 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 11:37:17 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1164989940.11808.181179.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
	<1164989940.11808.181179.camel@hal.voltaire.com>
Message-ID: <20061201183717.GC32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote:
> > For IPv6 only the lower 24 bits of each assigned IPv6 address are
> > used to construct a solicited node multicast in the range 
> > FF02::1:FF00:0/104. The Solicited Node Multicast address it not
> > expected to be uniquely subscribed.
> 
> Any idea on how many would subscribe ? What does this depend on ?

Each node subscribes to a SNM on an interface for each IPv6 address on
that interface. In most cases that should mean 1 subscription per
interface, but more is possible..

Generally IPv6 addresses should be constructed based on the EUI64 of
the IB interface. In this case the lower 24 bits of the SNM will be
the lower 24 bits of the EUI64. Thus in many cases the SNMs will be
cluster-unique..

Here is another thought.. Is there anything in the spec that says a
MGID must map to a MLID? If there is a single subscription why not
just do away with the MLID and return a unicast LID of the only
subscriber? That would probably solve 90% of the IPv6 issue Todd
pointed out. MGID compression would take care of the rest..

Jason


From halr at voltaire.com  Fri Dec  1 10:53:45 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 13:53:45 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061201183717.GC32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
	<1164989940.11808.181179.camel@hal.voltaire.com>
	<20061201183717.GC32366@obsidianresearch.com>
Message-ID: <1164999211.11808.186439.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 13:37, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2006 at 11:20:15AM -0500, Hal Rosenstock wrote:
> > > For IPv6 only the lower 24 bits of each assigned IPv6 address are
> > > used to construct a solicited node multicast in the range 
> > > FF02::1:FF00:0/104. The Solicited Node Multicast address it not
> > > expected to be uniquely subscribed.
> > 
> > Any idea on how many would subscribe ? What does this depend on ?
> 
> Each node subscribes to a SNM on an interface for each IPv6 address on
> that interface. In most cases that should mean 1 subscription per
> interface, but more is possible..

> Generally IPv6 addresses should be constructed based on the EUI64 of
> the IB interface. In this case the lower 24 bits of the SNM will be
> the lower 24 bits of the EUI64. Thus in many cases the SNMs will be
> cluster-unique..

It seems to depend on the low 24 bits of the IPv6 addresses in the
subnet being the same (as to whether there is more than 1 member of
these groups).

> Here is another thought.. Is there anything in the spec that says a
> MGID must map to a MLID?

Yes. Here's the first one:
p.149 line 3-8
The multicast LID range is a flat identifier space defined as 0xC000 to
0xFFFE.
The DLID for any packet which contains a multicast GID shall be within
the above specified multicast LID range.

I'm sure there are others in the spec if I looked further...

>  If there is a single subscription why not
> just do away with the MLID and return a unicast LID of the only
> subscriber?

The current spec requirements :-( But this is an interesting idea and
may warrant further consideration.

-- Hal

>  That would probably solve 90% of the IPv6 issue Todd
> pointed out. MGID compression would take care of the rest..
> 
> Jason
> 


From jgunthorpe at obsidianresearch.com  Fri Dec  1 11:24:12 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 12:24:12 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1164999211.11808.186439.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
	<1164989940.11808.181179.camel@hal.voltaire.com>
	<20061201183717.GC32366@obsidianresearch.com>
	<1164999211.11808.186439.camel@hal.voltaire.com>
Message-ID: <20061201192412.GD32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote:
> > Generally IPv6 addresses should be constructed based on the EUI64 of
> > the IB interface. In this case the lower 24 bits of the SNM will be
> > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be
> > cluster-unique..
> 
> It seems to depend on the low 24 bits of the IPv6 addresses in the
> subnet being the same (as to whether there is more than 1 member of
> these groups).

Correct. It is common practice for all IPv6 addresses to have the
lower 64 bits be the EUI64 of the interface. The administrator can
assign a different address, but that could be discouraged for
scalability reasoons.

> > Here is another thought.. Is there anything in the spec that says a
> > MGID must map to a MLID?
> 
> Yes. Here's the first one:
> p.149 line 3-8

Hmm. Thats a shame. It is a conformance statment too :< At least the
accepetance statements in C9 page 279+ don't specify to check that a
MGID is matched with a MLID so at least it should work with current
hardware.

Jason


From halr at voltaire.com  Fri Dec  1 11:28:55 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 14:28:55 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061201192412.GD32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
	<1164989940.11808.181179.camel@hal.voltaire.com>
	<20061201183717.GC32366@obsidianresearch.com>
	<1164999211.11808.186439.camel@hal.voltaire.com>
	<20061201192412.GD32366@obsidianresearch.com>
Message-ID: <1165001303.11808.187609.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 14:24, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2006 at 01:53:45PM -0500, Hal Rosenstock wrote:
> > > Generally IPv6 addresses should be constructed based on the EUI64 of
> > > the IB interface. In this case the lower 24 bits of the SNM will be
> > > the lower 24 bits of the EUI64. Thus in many cases the SNMs will be
> > > cluster-unique..
> > 
> > It seems to depend on the low 24 bits of the IPv6 addresses in the
> > subnet being the same (as to whether there is more than 1 member of
> > these groups).
> 
> Correct. It is common practice for all IPv6 addresses to have the
> lower 64 bits be the EUI64 of the interface. The administrator can
> assign a different address, but that could be discouraged for
> scalability reasoons.
> 
> > > Here is another thought.. Is there anything in the spec that says a
> > > MGID must map to a MLID?
> > 
> > Yes. Here's the first one:
> > p.149 line 3-8
> 
> Hmm. Thats a shame.

I think there are other issues with this and haven't thought about it
enough. What happens if a second node joins that group (as the low 24
bits match) ? How would the LID be revoked and changed to an MLID ?
There's more spec checking to do here...

>  It is a conformance statment too :< At least the
> accepetance statements in C9 page 279+ don't specify to check that a
> MGID is matched with a MLID

I would say that's a hole in the spec right now...

>  so at least it should work with current
> hardware.

I would use the word might rather than should in that last sentence.

-- Hal

> Jason


From todd.rimmer at qlogic.com  Fri Dec  1 11:42:09 2006
From: todd.rimmer at qlogic.com (Todd Rimmer)
Date: Fri, 1 Dec 2006 13:42:09 -0600
Subject: [openib-general] IPv6 and IPoIB scalability issue
Message-ID: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>

> From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com]
> Sent: Friday, December 01, 2006 1:37 PM
> To: Hal Rosenstock
> Cc: Todd Rimmer; openib-general at openib.org
> Subject: Re: [openib-general] IPv6 and IPoIB scalability issue
> 
> 
> Here is another thought.. Is there anything in the spec that says a
> MGID must map to a MLID? If there is a single subscription why not
> just do away with the MLID and return a unicast LID of the only
> subscriber? That would probably solve 90% of the IPv6 issue Todd
> pointed out. MGID compression would take care of the rest..
> 

Summary of alternatives and trade-offs.  Lets assume a 2000 node cluster
for analysis.

Option 1 use ALL Nodes Multicast
Non standard for IPoIB
small change to IPoIB code only
Works with all existing SMs
total of 5 MGIDs in cluster
5 Multicast subscriptions per node
total of 10,000 multicast member records in SA for fabric

Option 2 compress MGID to MLID mapping
Standard for IPoIB
modification of SMs required, significant change
configuration of MGID space in SM to consider for compression may be
required
total of 2005 MGIDs in cluster
up to 2005 multicast subscriptions per node (sender only for Solicited
Node initiators)
total of 2000*2005 (4,010,000) multicast member records in SA for fabric

Option 3 compress MGID to MLID mapping, use Unicast for Solicited Node
MGIDs
Standard for IPoIB
not clear if standard for IB
modification of SMs required, significant change
configuration of MGID space in SM to consider for compression may be
required
configuration of MGID space in SM to use for unicast may be required
total of 2005 MGIDs in cluster
up to 2005 multicast subscriptions per node (sender only for Solicited
Node initiators)
total of 2000*2005 (4,010,000) multicast member records in SA for fabric

Hence thus far, option 2 is most standard, option 3 may be standard,
option 1 has best scalability for SM.

It seems worth while to implement option 1 (which should be approx 10-20
lines of code in IPoIB) and continue to pursue option 2 and 3 as SM
features.  Then customers can choose which option works best for them.

Todd Rimmer


From jgunthorpe at obsidianresearch.com  Fri Dec  1 11:46:21 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 12:46:21 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1165001303.11808.187609.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E256@EPEXCH2.qlogic.org>
	<1164925747.11808.144971.camel@hal.voltaire.com>
	<20061130230136.GB32366@obsidianresearch.com>
	<1164989940.11808.181179.camel@hal.voltaire.com>
	<20061201183717.GC32366@obsidianresearch.com>
	<1164999211.11808.186439.camel@hal.voltaire.com>
	<20061201192412.GD32366@obsidianresearch.com>
	<1165001303.11808.187609.camel@hal.voltaire.com>
Message-ID: <20061201194621.GE32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 02:28:55PM -0500, Hal Rosenstock wrote:

> I think there are other issues with this and haven't thought about it
> enough. What happens if a second node joins that group (as the low 24
> bits match) ? How would the LID be revoked and changed to an MLID ?
> There's more spec checking to do here...

Oh, right, yeah revoking is pretty serious! Oh well.

Jason


From halr at voltaire.com  Fri Dec  1 12:07:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 15:07:23 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
Message-ID: <1165003608.11808.188882.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 14:42, Todd Rimmer wrote:
> > From: Jason Gunthorpe [mailto:jgunthorpe at obsidianresearch.com]
> > Sent: Friday, December 01, 2006 1:37 PM
> > To: Hal Rosenstock
> > Cc: Todd Rimmer; openib-general at openib.org
> > Subject: Re: [openib-general] IPv6 and IPoIB scalability issue
> > 
> > 
> > Here is another thought.. Is there anything in the spec that says a
> > MGID must map to a MLID? If there is a single subscription why not
> > just do away with the MLID and return a unicast LID of the only
> > subscriber? That would probably solve 90% of the IPv6 issue Todd
> > pointed out. MGID compression would take care of the rest..
> > 
> 
> Summary of alternatives and trade-offs.  Lets assume a 2000 node cluster
> for analysis.
> 
> Option 1 use ALL Nodes Multicast
> Non standard for IPoIB
> small change to IPoIB code only
> Works with all existing SMs
> total of 5 MGIDs in cluster
> 5 Multicast subscriptions per node
> total of 10,000 multicast member records in SA for fabric

IMO if you want to go down this direction, the place to discuss it is on
the ipoib IETF mailing list. It is still active although dormant or very
sleepy.

> Option 2 compress MGID to MLID mapping
> Standard for IPoIB
> modification of SMs required, significant change

Significant in what respect ? The code changes are reasonably simple I
think. Is it from the perspective of upgrading SMs in the field for this
? I think it is a feature for better IPv6 support.

> configuration of MGID space in SM to consider for compression may be
> required
> total of 2005 MGIDs in cluster
> up to 2005 multicast subscriptions per node (sender only for Solicited
> Node initiators)

Does the node subscribe to every IPv6 SN group ?

> total of 2000*2005 (4,010,000) multicast member records in SA for fabric

This is based on the above (which I'm not sure about) and is the worst
theoretical case, not the practical case.

> Option 3 compress MGID to MLID mapping, use Unicast for Solicited Node
> MGIDs
> Standard for IPoIB
> not clear if standard for IB

More issues than this

> modification of SMs required, significant change

At first glance, there are more issues here than option 2 in terms of SM
(and client operation).

> configuration of MGID space in SM to consider for compression may be
> required
> configuration of MGID space in SM to use for unicast may be required
> total of 2005 MGIDs in cluster
> up to 2005 multicast subscriptions per node (sender only for Solicited
> Node initiators)
> total of 2000*2005 (4,010,000) multicast member records in SA for fabric
> 
> Hence thus far, option 2 is most standard, option 3 may be standard,
> option 1 has best scalability for SM.
> 
> It seems worth while to implement option 1 (which should be approx 10-20
> lines of code in IPoIB) and continue to pursue option 2 and 3 as SM
> features.  Then customers can choose which option works best for them.

I think before pursuing option 1 there needs to be a discussion with the
IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).

-- Hal

> Todd Rimmer


From halr at voltaire.com  Fri Dec  1 12:32:09 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 15:32:09 -0500
Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In
 __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition
Message-ID: <1165005117.11808.189660.camel@hal.voltaire.com>

OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate
unneeded lock acquisition

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index f7f879b..d6c6968 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp(
           new_join_state | (p_mcm_port->scope_state & 0xf0);
 
         mcmember_rec.scope_state = p_mcm_port->scope_state;
+
+        CL_PLOCK_RELEASE( p_rcv->p_lock );
       }
       else
       {
@@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp(
                    "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: "
                    "osm_sm_mcgrp_leave failed\n" );
         }
-
-        CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
-        /* Note: The deletion of the mgrp itself will be done in the callback
-           for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */
       }
     }
     else
@@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp(
     goto Exit;
   }
 
-  CL_PLOCK_RELEASE( p_rcv->p_lock );
-
   /* Send an SA response */
   __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );
 

From ralph.campbel at qlogic.com  Fri Dec  1 12:39:16 2006
From: ralph.campbel at qlogic.com (Ralph Campbell)
Date: Fri, 1 Dec 2006 12:39:16 -0800 (PST)
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
Message-ID: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>

> On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:
>> > So what did you change since v1?  How do you deal with fitting 64-bit
>> > addresses into an sg list entry that has a 32-bit dma_addr_t?
>
>> The ipath_map_sg() handler for ib_dma_map_sg() doesn't store
>> anything in the struct scatterlist.  The translation is
>> done when ipath_sg_dma_address() is called which now
>> returns u64 instead of dma_addr_t thus avoiding the truncation
>> problem.
>
> And there is this open/TODO of calling kmap(page) on dma mapping time
> (or when ipath_sg_dma_address is called) and kunmap(page) on dma
> unmapping time, where you must store the kvaddr between the two calls
> and the sg does not have a room for it where dma_addr_t is u32 and
> kvaddr is u64 ....

Although the driver compiles on 32-bit kernels, it is unsupported
and never been tested. All known 64-bit systems don't define
CONFIG_HIGHMEM.  In spite of previous emails suggesting that
page_address() can return NULL without CONFIG_HIGHMEM defined,
the code in include/linux/mm.h doesn't allow it (assuming the
page pointer is valid and not some random address).
I verified this with Andrew Morton.

I don't see value in adding code which will be unsupported
and untested.

>> All of the callers to ib_dma_map_single(), ib_dma_map_page(),
>> and ib_sg_dma_address() have been modifed to save the address
>> in a u64 instead of a dma_addr_t.  This actually wasn't much
>> of a change since the address was being cast to u64 anway
>> when assigned to struct sge.addr.
>
> Its fixes a bug, so it actually somehow much of a change. Without it
> on arch as mentioned above, ipath_dma_map_single would return only a
> u32 portion of the kvaddr and later the ulp code would place this
> chopped address in sge.addr and the ipath driver would use the wrong
> address.
>
> Or.

I only meant that the change was minor compared to the previous
patches sent.  Of course, fixing a bug is important and not minor.


From elsen_david at yahoo.com  Fri Dec  1 12:50:15 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 12:50:15 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <754FC8FE0A97A94B906344259F447D4A0413F825@ES23SNLNT.srn.sandia.gov>
Message-ID: <248325.81711.qm@web58001.mail.re3.yahoo.com>

Steve,
   
  Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP stable branch? 
   
  I do not get some of library (librdmacm) gets created to be used by mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.

  David
  
"Chen, Helen Y" <hycsw at sandia.gov> wrote:
        Thanks,
   
  Helen

  
---------------------------------
  From: Steve Wise [mailto:swise at opengridcomputing.com]
Sent: Fri 12/1/2006 7:35 AM
To: Chen, Helen Y
Cc: Jeff Squyres; openib-general at openib.org; Leininger, Matthew L
Subject: RE: [openib-general] openMPI for 2.6.17.10 kernel


    On Thu, 2006-11-30 at 16:24 -0700, Chen, Helen Y wrote:
> Steve,
>
> As you know, I have my rnfs kernel running the stable iwarp-stack on
> my cluster now.  But how do I compile the userspace packages from that
> stack?
>
You build and install the userspace libraries from the iwarp stable
branch.  This will install all the needed header files to build other
packages that depend on them.  Like mvapich2-0.9.8, for instance.

If rping is working for you, then you've already done this.  The user
libs and header files are all installed in /usr/local by default.  If
you have /usr/local/include/rdma/rdma_cma.h, for instance, you've
probably already installed the userspace stuff from the iwarp stable
branch.

To build and install the user libs from the iwarp branch, please see the
wiki howto.  There is a section describing installing the userspace
libraries.

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Hope this helps...


Steve.


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

 
---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/516824af/attachment.html>

From swise at opengridcomputing.com  Fri Dec  1 12:54:20 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 01 Dec 2006 14:54:20 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <248325.81711.qm@web58001.mail.re3.yahoo.com>
References: <248325.81711.qm@web58001.mail.re3.yahoo.com>
Message-ID: <1165006460.6872.59.camel@stevo-desktop>


On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
>  
> Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP
> stable branch? 
>  
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


From elsen_david at yahoo.com  Fri Dec  1 13:14:52 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 13:14:52 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1165006460.6872.59.camel@stevo-desktop>
Message-ID: <20061201211452.6831.qmail@web58004.mail.re3.yahoo.com>

thanks

Steve Wise <swise at opengridcomputing.com> wrote:  

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
> 
> Is this https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP
> stable branch? 
> 
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/05f15c5b/attachment.html>

From jgunthorpe at obsidianresearch.com  Fri Dec  1 13:47:15 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 14:47:15 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1165003608.11808.188882.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
Message-ID: <20061201214715.GF32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:

> > configuration of MGID space in SM to consider for compression may
> > be required total of 2005 MGIDs in cluster up to 2005 multicast
> > subscriptions per node (sender only for Solicited Node initiators)
> 
> Does the node subscribe to every IPv6 SN group ?

A node will only use another nodes SN group in a send-only fashion and
only when it is doing neighbour discovery for that node.

So at the worst case you potentially have N^2 send-only subscriptions,
N normal subscriptions and N groups.

If IPv6 SN multicast MLIDs are always routed in the fabric so that all
IPv6 nodes can be send-only then the send-only subscriptions don't
need to be considered. Presumably because of this send-only join and
unjoin can result in no data structure in the SM..

> I think before pursuing option 1 there needs to be a discussion with the
> IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).

Option 1 sounds difficult to me. It would be hard to have interop
between nodes using this optimization and nodes that don't..

Another approach would be to manipulate the IPv6 address of the node
so that the lower 24 bits are the same. That gets the same effect, but
I'm not sure how you'd go about doing it :>

Jason


From David.Costa at Sun.COM  Fri Dec  1 14:20:31 2006
From: David.Costa at Sun.COM (David Costa)
Date: Fri, 01 Dec 2006 17:20:31 -0500
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
Message-ID: <4570AAAF.8070701@Sun.Com>

Hello all,

I am running the HPCC benchmark on a Sun Blade 8000 blade server. I have 
two blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes of 
memory each. The HPCC benchmark is running on a sun developed IB module 
that uses the Mellanox 25204 chips. When it gets to the MPIRandomAccess 
test, it immediately fails and I see the following messages listed below.

Does anyone know what the messages mean, and a possible  underlying 
cause?  Please reply to me directly as I am not subscribed to this list.

Thank you,

Dave Costa
david.costa at sun.com


[root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile 
/usr/local/bin/hpcc
24 - MPI_CANCEL : Internal MPI error!
[24] [] Aborting Program!
mpirun_rsh: Abort signaled from [24]
26 - MPI_CANCEL : Internal MPI error!
[26] [] Aborting Program!
15 - MPI_CANCEL : Internal MPI error!
[15] [] Aborting Program!
18 - MPI_CANCEL : Internal MPI error!
[18] [] Aborting Program!
22 - MPI_CANCEL : Internal MPI error!
[22] [] Aborting Program!
4 - MPI_CANCEL : Internal MPI error!
[4] [] Aborting Program!
13 - MPI_CANCEL : Internal MPI error!
[13] [] Aborting Program!
11 - MPI_CANCEL : Internal MPI error!
16 - MPI_CANCEL : Internal MPI error!
[16] [] Aborting Program!
[11] [] Aborting Program!
28 - MPI_CANCEL : Internal MPI error!
[28] [] Aborting Program!
[19] Abort: [an1-bl1:19] Got completion with error, code=12
 at line 2365 in file viacheck.c
[23] Abort: [an1-bl1:23] Got completion with error, code=12
 at line 2365 in file viacheck.c
[17] Abort: [an1-bl1:17] Got completion with error, code=12
 at line 2365 in file viacheck.c
done.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/a9028e71/attachment.html>

From rdreier at cisco.com  Fri Dec  1 14:26:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 14:26:12 -0800
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061201214715.GF32366@obsidianresearch.com> (Jason
	Gunthorpe's message of "Fri, 1 Dec 2006 14:47:15 -0700")
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201214715.GF32366@obsidianresearch.com>
Message-ID: <adahcwfmewr.fsf@cisco.com>

 > Option 1 sounds difficult to me. It would be hard to have interop
 > between nodes using this optimization and nodes that don't..

Yes, that is a major problem.

One intermediate thing we could do is to have nodes join their own
solicited-node group as a full member, but have other nodes send ND
messages to the all-nodes group.  Then the SM would only have O(N)
MCG memberships to maintain.  But it still requires the SM to be smart
about mapping multiple MCGs to a single MLID.

And even if that works, I'm not sure it's compliant with all the
relevant RFCs, and it might break in some strange situations...

(To be honest though, I think that the SM for a subnet with N nodes
should really be beefy enough to handle N^2 multicast memberships.
Even 10K nodes leads to only 100M group memberships, which shouldn't
be _that_ expensive with the right data structures)

 - R.


From boris at mellanox.com  Fri Dec  1 14:29:42 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Fri, 1 Dec 2006 14:29:42 -0800
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
Message-ID: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>

Hi David,
 
If you are using OFED-1.1 stack and OSU MVAPICH provided with the
OFED-1.1 package as your MPI layer,
the attached patch should solve your problem.
 
Please, let me know if that helped.
 
Regards,
 
Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com

________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of David Costa
Sent: Friday, December 01, 2006 2:21 PM
To: openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; Anthony
Vinciguerra; Thomas Babbit
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test


Hello all,

I am running the HPCC benchmark on a Sun Blade 8000 blade server. I have
two blades running RHEL4U3 and SLESSP3 respectively with 32 GBytes of
memory each. The HPCC benchmark is running on a sun developed IB module
that uses the Mellanox 25204 chips. When it gets to the MPIRandomAccess
test, it immediately fails and I see the following messages listed
below.

Does anyone know what the messages mean, and a possible  underlying
cause?  Please reply to me directly as I am not subscribed to this list.

Thank you,

Dave Costa
david.costa at sun.com


[root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile
/usr/local/bin/hpcc
24 - MPI_CANCEL : Internal MPI error!
[24] [] Aborting Program!
mpirun_rsh: Abort signaled from [24]
26 - MPI_CANCEL : Internal MPI error!
[26] [] Aborting Program!
15 - MPI_CANCEL : Internal MPI error!
[15] [] Aborting Program!
18 - MPI_CANCEL : Internal MPI error!
[18] [] Aborting Program!
22 - MPI_CANCEL : Internal MPI error!
[22] [] Aborting Program!
4 - MPI_CANCEL : Internal MPI error!
[4] [] Aborting Program!
13 - MPI_CANCEL : Internal MPI error!
[13] [] Aborting Program!
11 - MPI_CANCEL : Internal MPI error!
16 - MPI_CANCEL : Internal MPI error!
[16] [] Aborting Program!
[11] [] Aborting Program!
28 - MPI_CANCEL : Internal MPI error!
[28] [] Aborting Program!
[19] Abort: [an1-bl1:19] Got completion with error, code=12
 at line 2365 in file viacheck.c
[23] Abort: [an1-bl1:23] Got completion with error, code=12
 at line 2365 in file viacheck.c
[17] Abort: [an1-bl1:17] Got completion with error, code=12
 at line 2365 in file viacheck.c
done. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/28eca796/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smpi_cancel.patch
Type: application/octet-stream
Size: 1116 bytes
Desc: smpi_cancel.patch
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/28eca796/attachment.obj>

From rdreier at cisco.com  Fri Dec  1 14:28:01 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 14:28:01 -0800
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
In-Reply-To: <4570AAAF.8070701@Sun.Com> (David Costa's message of
	"Fri, 01 Dec 2006 17:20:31 -0500")
References: <4570AAAF.8070701@Sun.Com>
Message-ID: <adaac27metq.fsf@cisco.com>

 > 24 - MPI_CANCEL : Internal MPI error!

It might be useful to know what MPI implementation you're using...
(Also, knowing where you got your IB drivers and what version they are
wouldn't hurt either)

 - R.


From elsen_david at yahoo.com  Fri Dec  1 14:30:28 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 14:30:28 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1165006460.6872.59.camel@stevo-desktop>
Message-ID: <190551.24739.qm@web58010.mail.re3.yahoo.com>

Hi Steve,
I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
for the Ammasso card.

While compiling the libamso library, I got the following error:
make  all-am
make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.    -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \
        then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c  -fPIC -DPIC -o .libs/src_amso_la-cq.o
In file included from src/cq.c:42:
src/amso.h: In function 'to_amso_dev':
src/amso.h:83: warning: implicit declaration of function 'offsetof'
src/amso.h:83: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_ctx':
src/amso.h:88: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_pd':
src/amso.h:93: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_cq':
src/amso.h:98: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_qp':
src/amso.h:103: error: expected expression before 'struct'
make[1]: *** [src_amso_la-cq.lo] Error 1
make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
make: *** [all] Error 2

which seems to be complaining something in amso.h file in the following lins:

#define to_amso_xxx(xxx, type)                                          \
        ((struct amso_##type *)                                 \
         ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))

Can you let me know if I am missing something?
Thanks,
David

Steve Wise <swise at opengridcomputing.com> wrote: 

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
>  
> Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP
> stable branch? 
>  
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/d53ab7c7/attachment.html>

From elsen_david at yahoo.com  Fri Dec  1 14:40:04 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 14:40:04 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <190551.24739.qm@web58010.mail.re3.yahoo.com>
Message-ID: <142979.41727.qm@web58003.mail.re3.yahoo.com>

Steve,

I added 

#include <stddef.h>

in amso.h file, then I can compile it.

David


david elsen <elsen_david at yahoo.com> wrote: Hi Steve,
I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
for the Ammasso card.

While compiling the libamso library, I got the following error:
make  all-am
make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.    -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \
        then mv -f ".deps/src_amso_la-cq.Tpo" ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c  -fPIC -DPIC -o .libs/src_amso_la-cq.o
In file included from src/cq.c:42:
src/amso.h: In function  'to_amso_dev':
src/amso.h:83: warning: implicit declaration of function 'offsetof'
src/amso.h:83: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_ctx':
src/amso.h:88: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_pd':
src/amso.h:93: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_cq':
src/amso.h:98: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_qp':
src/amso.h:103: error: expected expression before 'struct'
make[1]: *** [src_amso_la-cq.lo] Error 1
make[1]: Leaving directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
make: *** [all] Error 2

which seems to be complaining something in amso.h file in the following lins:

#define to_amso_xxx(xxx,  type)                                          \
        ((struct amso_##type *)                                 \
         ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))

Can you let me know if I am missing something?
Thanks,
David

Steve Wise <swise at opengridcomputing.com> wrote: 

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
>   
> Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP
> stable branch? 
>  
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta._______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

 
---------------------------------
Access over 1 million songs - Yahoo! Music Unlimited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/1c23114d/attachment.html>

From maya986 at 012.net.il  Fri Dec  1 13:54:52 2006
From: maya986 at 012.net.il (=?windows-1255?Q?=F7=EC=E9=F4=E9=ED_=EC=E0=E9=F8=E5=F2=E9=ED?=)
Date: Fri, 1 Dec 2006 23:54:52 +0200
Subject: [openib-general] =?windows-1255?b?6fkg7Oog8Onx6eXvIOHo7O749+jp?=
	=?windows-1255?b?8OI/ICAg7vLsIDExLDAwMCD5Iucg4efl4/kg7O764Onu6e0u?=
Message-ID: <4ac3c53551a1a31224e6d220001ab284@012.net.il>

שלום וסליחה על ההפרעה !

ל "שיר בעצמך" דרושים נציגי/ות מכירה טלפוניים לעבודת שיווק, אדמין ומכירות.
 
*לו"ז עבודה: ימי א-ה, שעות :18:00-9:00  
*סביבת עבודה צעירה ודינאמית, איכותית ותומכת.
*שכר - בסיס+עמלות - מעל 11,000 ש"ח למתאימים.
 
דרישות:
*וותק של שנתיים לפחות במקום עבודה קודם.
* נסיון במוקד טלמרקטינג 
*נכונות לעבודה בלחץ ובשעות מטורפות.
* כושר שכנוע גבוה
*נכונות לעזור לאנשים.
*רצון להצליח בגדול.

מיקום המשרה: ת"א   
 
אם את/ עונה על הדרישות- שלח/י קו"ח מפורט (לפי שנים) במייל חוזר.

בתודה, 
סיגל. א.
מנהלת כ"א,
"שיר בעצמך"
shir4u.co.il

   
From elsen_david at yahoo.com  Fri Dec  1 14:58:06 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 14:58:06 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <142979.41727.qm@web58003.mail.re3.yahoo.com>
Message-ID: <602111.87729.qm@web58007.mail.re3.yahoo.com>

Steve,

I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 

./mpdboot -n 1
debug: starting
/root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory
running mpdallexit on ammasso1
LAUNCHED mpd on ammasso1 via  
debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
debug: mpd on ammasso1 on port 35352
RUNNING: mpd on ammasso1
debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Thanks,
David

david elsen <elsen_david at yahoo.com> wrote: Steve,

I added 

#include <stddef.h>

in amso.h file, then I can compile it.

David


david elsen <elsen_david at yahoo.com> wrote: Hi Steve,
I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
for the Ammasso card.

While compiling the libamso library, I got the following error:
make  all-am
make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.    -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \
        then mv -f ".deps/src_amso_la-cq.Tpo"  ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c  -fPIC -DPIC -o .libs/src_amso_la-cq.o
In file included from src/cq.c:42:
src/amso.h: In function  'to_amso_dev':
src/amso.h:83: warning: implicit declaration of function 'offsetof'
src/amso.h:83: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_ctx':
src/amso.h:88: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_pd':
src/amso.h:93: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_cq':
src/amso.h:98: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_qp':
src/amso.h:103: error: expected expression before 'struct'
make[1]: *** [src_amso_la-cq.lo] Error 1
make[1]: Leaving directory  `/usr/src/gen2/branches/iwarp/userspace/libamso'
make: *** [all] Error 2

which seems to be complaining something in amso.h file in the following lins:

#define to_amso_xxx(xxx,  type)                                          \
        ((struct amso_##type *)                                 \
         ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))

Can you let me know if I am missing something?
Thanks,
David

Steve Wise <swise at opengridcomputing.com> wrote: 

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
>   
> Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP
> stable branch? 
>  
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail  beta._______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
   

---------------------------------
Access over 1 million songs - Yahoo! Music Unlimited.

 
---------------------------------
Access over 1 million songs - Yahoo! Music Unlimited.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/dcfa194d/attachment.html>

From elsen_david at yahoo.com  Fri Dec  1 14:58:12 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 14:58:12 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <142979.41727.qm@web58003.mail.re3.yahoo.com>
Message-ID: <803619.69421.qm@web58009.mail.re3.yahoo.com>

Steve,

I can run rping, rdma_lat etc on the Ammasso card but when I try to run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 

./mpdboot -n 1
debug: starting
/root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries: librdmacm.so: cannot open shared object file: No such file or directory
running mpdallexit on ammasso1
LAUNCHED mpd on ammasso1 via  
debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
debug: mpd on ammasso1 on port 35352
RUNNING: mpd on ammasso1
debug: info for running mpd: {'ncpus': 1, 'list_port': 35352, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Thanks,
David

david elsen <elsen_david at yahoo.com> wrote: Steve,

I added 

#include <stddef.h>

in amso.h file, then I can compile it.

David


david elsen <elsen_david at yahoo.com> wrote: Hi Steve,
I am trying to use the https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
for the Ammasso card.

While compiling the libamso library, I got the following error:
make  all-am
make[1]: Entering directory `/usr/src/gen2/branches/iwarp/userspace/libamso'
if /bin/sh ./libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.    -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo `test -f 'src/cq.c' || echo './'`src/cq.c; \
        then mv -f ".deps/src_amso_la-cq.Tpo"  ".deps/src_amso_la-cq.Plo"; else rm -f ".deps/src_amso_la-cq.Tpo"; exit 1; fi
mkdir .libs
 gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF .deps/src_amso_la-cq.Tpo -c src/cq.c  -fPIC -DPIC -o .libs/src_amso_la-cq.o
In file included from src/cq.c:42:
src/amso.h: In function  'to_amso_dev':
src/amso.h:83: warning: implicit declaration of function 'offsetof'
src/amso.h:83: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_ctx':
src/amso.h:88: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_pd':
src/amso.h:93: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_cq':
src/amso.h:98: error: expected expression before 'struct'
src/amso.h: In function 'to_amso_qp':
src/amso.h:103: error: expected expression before 'struct'
make[1]: *** [src_amso_la-cq.lo] Error 1
make[1]: Leaving directory  `/usr/src/gen2/branches/iwarp/userspace/libamso'
make: *** [all] Error 2

which seems to be complaining something in amso.h file in the following lins:

#define to_amso_xxx(xxx,  type)                                          \
        ((struct amso_##type *)                                 \
         ((void *) ib##xxx - offsetof(struct amso_##type, ibv_##xxx)))

Can you let me know if I am missing something?
Thanks,
David

Steve Wise <swise at opengridcomputing.com> wrote: 

On Fri, 2006-12-01 at 12:50 -0800, david elsen wrote:
> Steve,
>   
> Is this  https://openfabrics.org/svn/gen2/branches/iwarp/  the iWARP
> stable branch? 
>  
> I do not get some of library (librdmacm) gets created to be used by
> mvapich2-0.9.8 on the Fedora 6 distribution with 2.6.17.13 kernel.
> 
> David
> 

The stable release of the iWARP branch is here:

https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable


Instructions on setting this up with Chelsio's T3 device are here:

https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3


Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail  beta._______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
   

---------------------------------
Access over 1 million songs - Yahoo! Music Unlimited.

 
---------------------------------
Cheap Talk? Check out Yahoo! Messenger's low PC-to-Phone call rates.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/9b0c7940/attachment.html>

From surs at cse.ohio-state.edu  Fri Dec  1 14:57:14 2006
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Fri, 1 Dec 2006 17:57:14 -0500
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
Message-ID: <20061201225713.GA7343@cse.ohio-state.edu>

Hi Boris,

Thanks for forwarding the patch to the list. This patch was also added
to the MVAPICH svn repository (both trunk and 0.9.8 bugfix branches)
a few days back.

David: If you are using MVAPICH, you can check out from the SVN 0.9.8
bugfix branch too.

Thanks,
Sayantan.

* On Dec,3 Boris Shpolyansky<boris at mellanox.com> wrote :
> Hi David,
>  
> If you are using OFED-1.1 stack and OSU MVAPICH provided with the OFED-1.1
> package as your MPI layer,
> the attached patch should solve your problem.
>  
> Please, let me know if that helped.
>  
> Regards,
>  
> Boris Shpolyansky
> Application Engineer
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
-- 
http://www.cse.ohio-state.edu/~surs


From Thomas.Talpey at netapp.com  Fri Dec  1 14:57:21 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Fri, 01 Dec 2006 17:57:21 -0500
Subject: [openib-general] NFS/RDMA for Linux: client and server update
	release 7
Message-ID: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com>

Network Appliance is pleased to announce release 7 of the NFS/RDMA
client and server for Linux 2.6.18. This update to the August release
fixes known issues, improves usability and server stability, and supports
NFSv4. The code supports both Infiniband and iWARP transports over
the standard openfabrics Linux facility.

<http://sourceforge.net/projects/nfs-rdma/>

<http://sourceforge.net/project/showfiles.php?group_id=97628&package_id=213593>

This code is functionally similar to the previous RC6 release, with many
bugfixes and performance improvements applied. The client and server
now use port 2050 (instead of overloading the standard NFS/TCP 2049),
pending further discussion and official assignment as proposed in the most
recent IETF working group meeting. An alignment issue leading to performance
impact on IA64 architectures has been corrected in the server. Extensive
further testing on Infiniband and iWARP was performed and (for example)
this NFS/RDMA code was demonstrated running Oracle 10g Reliable Application
Clusters at SuperComputing 2006 last month.

A full list of bugs resolved is available at the project's tracking page:

<http://sourceforge.net/tracker/?group_id=97628&atid=618583>

We welcome protocol comments, implementation comments and user
experience, directly or on any of the above mailing lists.

Tom Talpey, for the NFS/RDMA project.


From rdreier at cisco.com  Fri Dec  1 15:12:41 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 15:12:41 -0800
Subject: [openib-general] NFS/RDMA for Linux: client and server update
 release 7
In-Reply-To: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com> (
	Thomas Talpey's message of "Fri, 01 Dec 2006 17:57:21 -0500")
References: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com>
Message-ID: <aday7prky6u.fsf@cisco.com>

What is the status of moving this code towards merging to the upstream kernel?

Thanks,
  Roland


From swise at opengridcomputing.com  Fri Dec  1 15:14:31 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 01 Dec 2006 17:14:31 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <602111.87729.qm@web58007.mail.re3.yahoo.com>
References: <602111.87729.qm@web58007.mail.re3.yahoo.com>
Message-ID: <1165014871.6872.85.camel@stevo-desktop>

I haven't tested mvapich2 with ammasso.  But OSU has. I'm CCing their
dev team so maybe they can help.

Steve.


On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:
> Steve,
> 
> I can run rping, rdma_lat etc on the Ammasso card but when I try to
> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 
> 
> ./mpdboot -n 1
> debug: starting
> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:
> librdmacm.so: cannot open shared object file: No such file or
> directory
> running mpdallexit on ammasso1
> LAUNCHED mpd on ammasso1 via  
> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
> debug: mpd on ammasso1 on port 35352
> RUNNING: mpd on ammasso1
> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,
> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}
> 
> Thanks,
> David
> 
> david elsen <elsen_david at yahoo.com> wrote:
>         Steve,
>         
>         I added 
>         
>         #include <stddef.h>
>         
>         in amso.h file, then I can compile it.
>         
>         David
>         
>         
>         david elsen <elsen_david at yahoo.com> wrote:
>                 Hi Steve,
>                 I am trying to use the
>                 https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
>                 for the Ammasso card.
>                 
>                 While compiling the libamso library, I got the
>                 following error:
>                 make  all-am
>                 make[1]: Entering directory
>                 `/usr/src/gen2/branches/iwarp/userspace/libamso'
>                 if /bin/sh ./libtool --tag=CC --mode=compile gcc
>                 -DHAVE_CONFIG_H -I. -I. -I.    -g -Wall -D_GNU_SOURCE
>                 -g -O2 -MT src_amso_la-cq.lo -MD -MP -MF
>                 ".deps/src_amso_la-cq.Tpo" -c -o src_amso_la-cq.lo
>                 `test -f 'src/cq.c' || echo './'`src/cq.c; \
>                         then mv -f ".deps/src_amso_la-cq.Tpo"
>                 ".deps/src_amso_la-cq.Plo"; else rm -f
>                 ".deps/src_amso_la-cq.Tpo"; exit 1; fi
>                 mkdir .libs
>                  gcc -DHAVE_CONFIG_H -I. -I. -I. -g -Wall
>                 -D_GNU_SOURCE -g -O2 -MT src_amso_la-cq.lo -MD -MP
>                 -MF .deps/src_amso_la-cq.Tpo -c src/cq.c  -fPIC -DPIC
>                 -o .libs/src_amso_la-cq.o
>                 In file included from src/cq.c:42:
>                 src/amso.h: In function 'to_amso_dev':
>                 src/amso.h:83: warning: implicit declaration of
>                 function 'offsetof'
>                 src/amso.h:83: error: expected expression before
>                 'struct'
>                 src/amso.h: In function 'to_amso_ctx':
>                 src/amso.h:88: error: expected expression before
>                 'struct'
>                 src/amso.h: In function 'to_amso_pd':
>                 src/amso.h:93: error: expected expression before
>                 'struct'
>                 src/amso.h: In function 'to_amso_cq':
>                 src/amso.h:98: error: expected expression before
>                 'struct'
>                 src/amso.h: In function 'to_amso_qp':
>                 src/amso.h:103: error: expected expression before
>                 'struct'
>                 make[1]: *** [src_amso_la-cq.lo] Error 1
>                 make[1]: Leaving directory
>                 `/usr/src/gen2/branches/iwarp/userspace/libamso'
>                 make: *** [all] Error 2
>                 
>                 which seems to be complaining something in amso.h file
>                 in the following lins:
>                 
>                 #define to_amso_xxx(xxx, type)
>                 \
>                         ((struct amso_##type *)
>                 \
>                          ((void *) ib##xxx - offsetof(struct
>                 amso_##type, ibv_##xxx)))
>                 
>                 Can you let me know if I am missing something?
>                 Thanks,
>                 David
>                 
>                 Steve Wise <swise at opengridcomputing.com> wrote:
>                         
>                         
>                         On Fri, 2006-12-01 at 12:50 -0800, david elsen
>                         wrote:
>                         > Steve,
>                         > 
>                         > Is this
>                         https://openfabrics.org/svn/gen2/branches/iwarp/ the iWARP
>                         > stable branch? 
>                         > 
>                         > I do not get some of library (librdmacm)
>                         gets created to be used by
>                         > mvapich2-0.9.8 on the Fedora 6 distribution
>                         with 2.6.17.13 kernel.
>                         > 
>                         > David
>                         > 
>                         
>                         The stable release of the iWARP branch is
>                         here:
>                         
>                         https://openib.org/svn/gen2/branches/iwarp/tags/iwarp-2.6.17-stable
>                         
>                         
>                         Instructions on setting this up with Chelsio's
>                         T3 device are here:
>                         
>                         https://openib.org/tiki/tiki-index.php?page=Install+OpenIB+for+Chelsio+T3
>                         
>                         
>                         Steve.
>                         
>                         
>                 
>                 
>                 
>                 ______________________________________________________
>                 Everyone is raving about the all-new Yahoo! Mail
>                 beta._______________________________________________
>                 openib-general mailing list
>                 openib-general at openib.org
>                 http://openib.org/mailman/listinfo/openib-general
>                 
>                 To unsubscribe, please visit
>                 http://openib.org/mailman/listinfo/openib-general
>         
>         
>         
>         ______________________________________________________________
>         Access over 1 million songs - Yahoo! Music Unlimited.
> 
> 
> 
> 
> ______________________________________________________________________
> Access over 1 million songs - Yahoo! Music Unlimited.


From rdreier at cisco.com  Fri Dec  1 15:14:34 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 15:14:34 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	(Ralph Campbell's message of "Fri, 1 Dec 2006 12:39:16 -0800 (PST)")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
Message-ID: <adau00fky3p.fsf@cisco.com>

 > Although the driver compiles on 32-bit kernels, it is unsupported
 > and never been tested. All known 64-bit systems don't define
 > CONFIG_HIGHMEM.  In spite of previous emails suggesting that
 > page_address() can return NULL without CONFIG_HIGHMEM defined,
 > the code in include/linux/mm.h doesn't allow it (assuming the
 > page pointer is valid and not some random address).
 > I verified this with Andrew Morton.

Hmm, is there no way to make this work on 32-bit kernels?  I don't
want to do something that we'll have to change again if we want to
make things work on 32-bits.

(And I know that qlogic has no intention of supporting the driver on
32-bit kernels, but we shouldn't make it impossible for someone else
to fix it)


From rdreier at cisco.com  Fri Dec  1 15:15:44 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 15:15:44 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1164918691.14800.101.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Thu, 30 Nov 2006 12:31:31 -0800")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
Message-ID: <adapsb3ky1r.fsf@cisco.com>

Oh yeah, one other thing...

could you respin this so that all the new dma_xxx wrappers go into a
new file like <rdma/ib_dma_mapping.h> (and include that from
<rdma/ib_verbs.h>)?  ib_verbs.h is already too big I think.


From jgunthorpe at obsidianresearch.com  Fri Dec  1 15:17:53 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 16:17:53 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1165003608.11808.188882.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
Message-ID: <20061201231753.GG32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:

> > total of 2000*2005 (4,010,000) multicast member records in SA for fabric
> 
> This is based on the above (which I'm not sure about) and is the worst
> theoretical case, not the practical case.

It isn't in the IB spec, but what would really help here is to be able
to join a multicast prefix (more than 1 group with a single entry).

Todd's option 1 optimization is then easially realized by having all
IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries) as
full members. This provides interoperability between with stacks with
this feature and without.

Option 2 works better as well because all the nodes join
FF02::1:FF00:0/104 as a send-only member on startup and then you only
get N*2 multicast records to maintain.

This also would improve the performance of IPv6 ND by not having to
join/leave the SN groups for each ND query.

IBA would have to be changed to support a prefix bits field in the
MCMemberRecord structure though..

Jason


From David.Costa at Sun.COM  Fri Dec  1 15:22:51 2006
From: David.Costa at Sun.COM (David Costa)
Date: Fri, 01 Dec 2006 18:22:51 -0500
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
Message-ID: <4570B94B.6030202@Sun.Com>

My apologies to everyone who replied, I am indeed using OFED 1.1 and the 
included OSU MVAPICH. I will try your patch on Monday Boris and reply to 
the list about how I made out.

Best Regards,

Dave Costa

Boris Shpolyansky wrote:
> Hi David,
>  
> If you are using OFED-1.1 stack and OSU MVAPICH provided with the 
> OFED-1.1 package as your MPI layer,
> the attached patch should solve your problem.
>  
> Please, let me know if that helped.
>  
> Regards,
>  
> Boris Shpolyansky
> Application Engineer
> Mellanox Technologies Inc.
> 2900 Stender Way
> Santa Clara, CA 95054
> Tel.: (408) 916 0014
> Fax: (408) 970 3403
> Cell: (408) 834 9365
> www.mellanox.com
>
> ------------------------------------------------------------------------
> *From:* openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] *On Behalf Of *David Costa
> *Sent:* Friday, December 01, 2006 2:21 PM
> *To:* openib-general at openib.org; David.Costa at Sun.COM; Robert Houk; 
> Anthony Vinciguerra; Thomas Babbit
> *Subject:* [openib-general] HPCC benchmark aborts at MPIRandomAccess test
>
> Hello all,
>
> I am running the HPCC benchmark on a Sun Blade 8000 blade server. I 
> have two blades running RHEL4U3 and SLESSP3 respectively with 32 
> GBytes of memory each. The HPCC benchmark is running on a sun 
> developed IB module that uses the Mellanox 25204 chips. When it gets 
> to the MPIRandomAccess test, it immediately fails and I see the 
> following messages listed below.
>
> Does anyone know what the messages mean, and a possible  underlying 
> cause?  Please reply to me directly as I am not subscribed to this list.
>
> Thank you,
>
> Dave Costa
> david.costa at sun.com
>
>
> [root at an1-bl0 ~]# mpirun_rsh -rsh -np 32 -hostfile /root/hostfile 
> /usr/local/bin/hpcc
> 24 - MPI_CANCEL : Internal MPI error!
> [24] [] Aborting Program!
> mpirun_rsh: Abort signaled from [24]
> 26 - MPI_CANCEL : Internal MPI error!
> [26] [] Aborting Program!
> 15 - MPI_CANCEL : Internal MPI error!
> [15] [] Aborting Program!
> 18 - MPI_CANCEL : Internal MPI error!
> [18] [] Aborting Program!
> 22 - MPI_CANCEL : Internal MPI error!
> [22] [] Aborting Program!
> 4 - MPI_CANCEL : Internal MPI error!
> [4] [] Aborting Program!
> 13 - MPI_CANCEL : Internal MPI error!
> [13] [] Aborting Program!
> 11 - MPI_CANCEL : Internal MPI error!
> 16 - MPI_CANCEL : Internal MPI error!
> [16] [] Aborting Program!
> [11] [] Aborting Program!
> 28 - MPI_CANCEL : Internal MPI error!
> [28] [] Aborting Program!
> [19] Abort: [an1-bl1:19] Got completion with error, code=12
>  at line 2365 in file viacheck.c
> [23] Abort: [an1-bl1:23] Got completion with error, code=12
>  at line 2365 in file viacheck.c
> [17] Abort: [an1-bl1:17] Got completion with error, code=12
>  at line 2365 in file viacheck.c
> done. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/63493259/attachment.html>

From halr at voltaire.com  Fri Dec  1 15:25:35 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 18:25:35 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061201214715.GF32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201214715.GF32366@obsidianresearch.com>
Message-ID: <1165015489.11808.195631.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 16:47, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:
> 
> > > configuration of MGID space in SM to consider for compression may
> > > be required total of 2005 MGIDs in cluster up to 2005 multicast
> > > subscriptions per node (sender only for Solicited Node initiators)
> > 
> > Does the node subscribe to every IPv6 SN group ?
> 
> A node will only use another nodes SN group in a send-only fashion and
> only when it is doing neighbour discovery for that node.
> 
> So at the worst case you potentially have N^2 send-only subscriptions,
> N normal subscriptions and N groups.

Send only subscriptions are largely the same (in terms of SM/SA) as full
subscriptions except in a couple of details.

> If IPv6 SN multicast MLIDs are always routed in the fabric so that all
> IPv6 nodes can be send-only then the send-only subscriptions don't
> need to be considered. Presumably because of this send-only join and
> unjoin can result in no data structure in the SM..

There is a data structure associated with these memberships.

-- Hal

> > I think before pursuing option 1 there needs to be a discussion with the
> > IETF WG involving the RFC authors (Vivek Kashyap, Jerry Chu).
> 
> Option 1 sounds difficult to me. It would be hard to have interop
> between nodes using this optimization and nodes that don't..
> 
> Another approach would be to manipulate the IPv6 address of the node
> so that the lower 24 bits are the same. That gets the same effect, but
> I'm not sure how you'd go about doing it :>
> 
> Jason


From halr at voltaire.com  Fri Dec  1 15:32:34 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 18:32:34 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <adahcwfmewr.fsf@cisco.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201214715.GF32366@obsidianresearch.com>
	<adahcwfmewr.fsf@cisco.com>
Message-ID: <1165015925.11808.195836.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 17:26, Roland Dreier wrote:
>  > Option 1 sounds difficult to me. It would be hard to have interop
>  > between nodes using this optimization and nodes that don't..
> 
> Yes, that is a major problem.
> 
> One intermediate thing we could do is to have nodes join their own
> solicited-node group as a full member, but have other nodes send ND
> messages to the all-nodes group.  Then the SM would only have O(N)
> MCG memberships to maintain.  But it still requires the SM to be smart
> about mapping multiple MCGs to a single MLID.
> 
> And even if that works, I'm not sure it's compliant with all the
> relevant RFCs, and it might break in some strange situations...
> 
> (To be honest though, I think that the SM for a subnet with N nodes
> should really be beefy enough to handle N^2 multicast memberships.
> Even 10K nodes leads to only 100M group memberships, which shouldn't
> be _that_ expensive with the right data structures)

The data structures are one concern. The others would be routing N large
(multicast) trees and also the SA transaction rate this causes (similar
to the large path record request case).

-- Hal

>  - R.


From halr at voltaire.com  Fri Dec  1 15:47:38 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 01 Dec 2006 18:47:38 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061201231753.GG32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201231753.GG32366@obsidianresearch.com>
Message-ID: <1165016801.11808.196277.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 18:17, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2006 at 03:07:23PM -0500, Hal Rosenstock wrote:
> 
> > > total of 2000*2005 (4,010,000) multicast member records in SA for fabric
> > 
> > This is based on the above (which I'm not sure about) and is the worst
> > theoretical case, not the practical case.
> 
> It isn't in the IB spec, but what would really help here is to be able
> to join a multicast prefix (more than 1 group with a single entry).
> 
> Todd's option 1 optimization is then easially realized by having all
> IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries)

These are IPmc groups not IB mc groups though. I suppose you are asking
for the equivalent function in IB. When that subscribe is done, would it
automatically collapse to 1 MLID ? If that's what you mean, a spec
extension for this could be proposed and carried forward at the (IBTA)
MgtWG. Is there a special value of those 24 bits which is not used (and
could be used to indicate subscribe all) ? Or do you see another way to
indicate this ? There are some reserved bits at the end of
MCMemberRecord which could also be used to indicate this. That's
probably better.

>  as
> full members. This provides interoperability between with stacks with
> this feature and without.
> 
> Option 2 works better as well because all the nodes join
> FF02::1:FF00:0/104 as a send-only member on startup and then you only
> get N*2 multicast records to maintain.
> 
> This also would improve the performance of IPv6 ND by not having to
> join/leave the SN groups for each ND query.
> 
> IBA would have to be changed to support a prefix bits field in the
> MCMemberRecord structure though..

Is a full prefix needed or only 1 bit indicating join all ? If a prefix
is needed, it sounds like it is 24 bits in width. (That appears more
than what is available but I'll look more).

-- Hal

> Jason


From ralph.campbel at qlogic.com  Fri Dec  1 16:27:59 2006
From: ralph.campbel at qlogic.com (Ralph Campbell)
Date: Fri, 1 Dec 2006 16:27:59 -0800 (PST)
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adau00fky3p.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
Message-ID: <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>

>  > Although the driver compiles on 32-bit kernels, it is unsupported
>  > and never been tested. All known 64-bit systems don't define
>  > CONFIG_HIGHMEM.  In spite of previous emails suggesting that
>  > page_address() can return NULL without CONFIG_HIGHMEM defined,
>  > the code in include/linux/mm.h doesn't allow it (assuming the
>  > page pointer is valid and not some random address).
>  > I verified this with Andrew Morton.
>
> Hmm, is there no way to make this work on 32-bit kernels?  I don't
> want to do something that we'll have to change again if we want to
> make things work on 32-bits.
>
> (And I know that qlogic has no intention of supporting the driver on
> 32-bit kernels, but we shouldn't make it impossible for someone else
> to fix it)

I don't think this is impossible to implement.  I just wanted
to avoid the work unless you and others thought it was really
worth it given the reality that we already have a large
test matrix of platforms, distros, and kernel versions and
it probably won't get much testing.  It is possible that
at some point 32-bit kernels will become a priority
but I don't know when that might happen.


From ralph.campbel at qlogic.com  Fri Dec  1 16:28:37 2006
From: ralph.campbel at qlogic.com (Ralph Campbell)
Date: Fri, 1 Dec 2006 16:28:37 -0800 (PST)
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adapsb3ky1r.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<adapsb3ky1r.fsf@cisco.com>
Message-ID: <40471.71.131.5.186.1165019317.squirrel@rocky.pathscale.com>

> Oh yeah, one other thing...
>
> could you respin this so that all the new dma_xxx wrappers go into a
> new file like <rdma/ib_dma_mapping.h> (and include that from
> <rdma/ib_verbs.h>)?  ib_verbs.h is already too big I think.

Sure, no problem.


From rowland at cse.ohio-state.edu  Fri Dec  1 16:36:56 2006
From: rowland at cse.ohio-state.edu (Shaun Rowland)
Date: Fri, 01 Dec 2006 19:36:56 -0500
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1165014871.6872.85.camel@stevo-desktop>
References: <602111.87729.qm@web58007.mail.re3.yahoo.com>
	<1165014871.6872.85.camel@stevo-desktop>
Message-ID: <4570CAA8.5080806@cse.ohio-state.edu>

Steve Wise wrote:
> I haven't tested mvapich2 with ammasso.  But OSU has. I'm CCing their
> dev team so maybe they can help.
> 
> Steve.
> 
> 
> 
> On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:
>> Steve,
>>
>> I can run rping, rdma_lat etc on the Ammasso card but when I try to
>> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 
>>
>> ./mpdboot -n 1
>> debug: starting
>> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:
>> librdmacm.so: cannot open shared object file: No such file or
>> directory
>> running mpdallexit on ammasso1
>> LAUNCHED mpd on ammasso1 via  
>> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
>> debug: mpd on ammasso1 on port 35352
>> RUNNING: mpd on ammasso1
>> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,
>> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Hello David and Steve. We discussed this problem in detail on the
mvapich-discuss list recently. David, you indicated the following in
your last email about this to mvapich-discuss on 11/26/2006:

"For some reason, it is working in SuSE, and not working in Fedora."

Is this still the case? Were the libraries built specifically on the
Fedora Core 6 system, or are you using libraries that were built on
SuSE? I assume they were built on Fedora Core 6. Were you trying to run
this as root or as a regular user? I am not sure exactly how this might
affect shared library loading, but it is possible there is a difference.

In our previous discussion, your library path did indeed have a
librdmacm.so file, though it could not be loaded for an unknown reason.
It is unclear to me if this email thread indicates that you have tried
to rebuild that and are experiencing the same issue. Where you able to
try running that test shared library example I gave and did it work? Did
it work as the same user you are trying to run MVAPICH as?

It seems clear this is a runtime loader problem on Fedora Core 6, or on
your particular configuration there. That is what cannot find the
library. It is similar to the libtest code I provided as an example:

[rowland at e14-oib libtest]$ ls
Makefile  test.c  test.h  test-program.c

[rowland at e14-oib libtest]$ make normal
gcc -c -fPIC test.c
gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o
ln -s libtest.so.1.0 libtest.so.1
ln -s libtest.so.1 libtest.so
gcc    -c -o test-program.o test-program.c
gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest

[rowland at e14-oib libtest]$ ldd test-program
         libtest.so.1 => not found
         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
./test-program: error while loading shared libraries: libtest.so.1: 
cannot open shared object file: No such file or directory

[rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD

[rowland at e14-oib libtest]$ ldd test-program
         libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 
(0x00002abbf9aee000)
         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
In shared library function...

In previous email your ldd output showed the library was being found:

Please see the output of ldd /usr/local/mvapich2/bin/mpdroot :
[root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot
         linux-gate.so.1 =>  (0xffffe000)
         librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000)
         libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000)
         libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000)
         libc.so.6 => /lib/libc.so.6 (0x00ca7000)
         libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000)
         libdl.so.2 => /lib/libdl.so.2 (0x00de6000)
         libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000)
         /lib/ld-linux.so.2 (0x002d8000)

But that path is different than the one you are quoting above. Does an
ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the
same user you are trying to run it as?

I have one more idea for you to try here. You can do the following:

export LD_DEBUG=all
/root/0.9.8-RELEASE/bin/mpdroot >&output
unset LD_DEBUG

Then take a look at the output file to see if there are any relevant
error messages. Don't forget to unset LD_DEBUG before doing anything else.

Also, just to be sure, if you run "file <path to librdmacm.so>" what
does it say? It should indicate that it is a shared library as similarly to:

[rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so*
/usr/local/ofed/lib64/librdmacm.so:       symbolic link to 
`librdmacm.so.0.9.0'
/usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, 
AMD x86-64, version 1 (SYSV), not stripped

Unfortunately, we do not have any Fedora Core 6 systems to investigate
this problem on at this time, and I don't know anything about what might
be there that would cause a problem. As far as I know, there shouldn't
be. However, it seems there is some runtime issue on your Fedora Core 6
machine or with how this is being run there. If it is in fact working on
another distribution as you indicated in your previous response to us,
then that also strongly points in this direction.
-- 
Shaun Rowland	rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


From rdreier at cisco.com  Fri Dec  1 17:09:40 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 01 Dec 2006 17:09:40 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	(Ralph Campbell's message of "Fri, 1 Dec 2006 16:27:59 -0800 (PST)")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
Message-ID: <aday7prje7f.fsf@cisco.com>

 > I don't think this is impossible to implement.  I just wanted
 > to avoid the work unless you and others thought it was really
 > worth it given the reality that we already have a large
 > test matrix of platforms, distros, and kernel versions and
 > it probably won't get much testing.  It is possible that
 > at some point 32-bit kernels will become a priority
 > but I don't know when that might happen.

So you think you could do the ib_dma_xxx stuff for ipath without
affecting anything outside of ipath?  (Assuming this is merged of course)
What would be the rough outline of how that would work?


From jgunthorpe at obsidianresearch.com  Fri Dec  1 17:20:54 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Fri, 1 Dec 2006 18:20:54 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1165016801.11808.196277.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201231753.GG32366@obsidianresearch.com>
	<1165016801.11808.196277.camel@hal.voltaire.com>
Message-ID: <20061202012054.GH32366@obsidianresearch.com>

On Fri, Dec 01, 2006 at 06:47:38PM -0500, Hal Rosenstock wrote:

> > It isn't in the IB spec, but what would really help here is to be able
> > to join a multicast prefix (more than 1 group with a single entry).
> > 
> > Todd's option 1 optimization is then easially realized by having all
> > IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries)
> 
> These are IPmc groups not IB mc groups though. I suppose you are asking
> for the equivalent function in IB. When that subscribe is done,
> would it

Right, it looks like the MGID would be close to
FF1E:601B:xxxx::1FF00:0/104 for IPv6 SN multicast (RFC4391).

My thinking was to add a new 8 bit field to MCMemberRecord called
prefixLen. Broadly (without considering how to manage compatability)
joins like we have today would set prefixLen to 128. To do this
suggestion we'd set prefixLen to 104.

prefixLen of 104 means 2**24 MGID addresses are matched by the join
and the node is subscribed to them all. The existing MGID field is
used to encode the prefix bits, only the first 104 bits are used.

Off hand I don't see a way to indicate the length using the existing
record fields.

If we call a MCMemberRecord with a prefixLen != 128 a prefix join..

MLID mapping is a little tricky in this scheme since once a prefix
join is registered you have to start unioning membership lists with
other joins to get the right spans. (ie joins to FF1E::1000/120 and
FF1E::1001/128 may have different MLIDs but they would both reach a
unioned membership)

That would mean that a send-only prefix join MLID would effectively be
a broadcast MLID so if you use it with option 2 you reduce the SM
query rate and subscription load, but you are just broadcasting ND
packets. I guess the sensible use would be with option 1 where both
the send-only and and full-membership join are a /104 prefix join.
[Basically, it ends up using broadcasting like Todd sugested, but in
 a way where the SM can properly integrate IPoIB stacks that
 don't use prefix joins.]

I don't know if this is worth persuing.. Certainly if the main issue
is just MLID usage then option 2 is much simpler. Something like this
might be part of improving IPv6 ND scalability but that is a different
problem entirely (and does anyone care?)..

Jason


From Thomas.Talpey at netapp.com  Fri Dec  1 18:00:30 2006
From: Thomas.Talpey at netapp.com (Talpey, Thomas)
Date: Fri, 01 Dec 2006 21:00:30 -0500
Subject: [openib-general] NFS/RDMA for Linux: client and server update
 release 7
In-Reply-To: <aday7prky6u.fsf@cisco.com>
References: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com>
	<aday7prky6u.fsf@cisco.com>
Message-ID: <EXNANE01tSgcOrBLSAQ00000180@exnane01.hq.netapp.com>

At 06:12 PM 12/1/2006, Roland Dreier wrote:
>What is the status of moving this code towards merging to the upstream kernel?

For the client there are two main prerequisites, both in the RPC layer
and both in progress. One is the completion of the RPC transport switch
merge, mainly the ability to load as modules. The second is a new mount
syscall api, to allow transport-specific arguments to be passed in. We
have a temporary solution for that at the moment. When these two are
in place, the client is ready to consider merging.

The server actually doesn't have these dependencies, but it does need
to be updated to match the new code in 2.6.19 which raises the maximum
rpc payload size, and some additional hardening/improvements which we
found in code review. We're waiting to complete this work, which hopefully
will be this month.

Bottom line, we can put it on the table soon.

Tom.


From ralph.campbel at qlogic.com  Fri Dec  1 18:08:42 2006
From: ralph.campbel at qlogic.com (Ralph Campbell)
Date: Fri, 1 Dec 2006 18:08:42 -0800 (PST)
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <aday7prje7f.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
Message-ID: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>

>  > I don't think this is impossible to implement.  I just wanted
>  > to avoid the work unless you and others thought it was really
>  > worth it given the reality that we already have a large
>  > test matrix of platforms, distros, and kernel versions and
>  > it probably won't get much testing.  It is possible that
>  > at some point 32-bit kernels will become a priority
>  > but I don't know when that might happen.
>
> So you think you could do the ib_dma_xxx stuff for ipath without
> affecting anything outside of ipath?  (Assuming this is merged of course)
> What would be the rough outline of how that would work?

Basically, use a hash table to store the kmap result.
See attached for 90% of the code.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ipath_dma.c
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/d9b23b92/attachment.c>

From elsen_david at yahoo.com  Fri Dec  1 19:07:24 2006
From: elsen_david at yahoo.com (david elsen)
Date: Fri, 1 Dec 2006 19:07:24 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <4570CAA8.5080806@cse.ohio-state.edu>
Message-ID: <837388.34727.qm@web58012.mail.re3.yahoo.com>

Shaun,
   
  It was working on one of my Fedora system. I tried to do the same installation on my other system which has SuSe 9.3 and it is not working there.
   
  So I am not sure what is going on with this.
   
  Thanks,
  David
  

Shaun Rowland <rowland at cse.ohio-state.edu> wrote:
  Steve Wise wrote:
> I haven't tested mvapich2 with ammasso. But OSU has. I'm CCing their
> dev team so maybe they can help.
> 
> Steve.
> 
> 
> 
> On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:
>> Steve,
>>
>> I can run rping, rdma_lat etc on the Ammasso card but when I try to
>> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 
>>
>> ./mpdboot -n 1
>> debug: starting
>> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:
>> librdmacm.so: cannot open shared object file: No such file or
>> directory
>> running mpdallexit on ammasso1
>> LAUNCHED mpd on ammasso1 via 
>> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py --ncpus=1 -e -d
>> debug: mpd on ammasso1 on port 35352
>> RUNNING: mpd on ammasso1
>> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,
>> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Hello David and Steve. We discussed this problem in detail on the
mvapich-discuss list recently. David, you indicated the following in
your last email about this to mvapich-discuss on 11/26/2006:

"For some reason, it is working in SuSE, and not working in Fedora."

Is this still the case? Were the libraries built specifically on the
Fedora Core 6 system, or are you using libraries that were built on
SuSE? I assume they were built on Fedora Core 6. Were you trying to run
this as root or as a regular user? I am not sure exactly how this might
affect shared library loading, but it is possible there is a difference.

In our previous discussion, your library path did indeed have a
librdmacm.so file, though it could not be loaded for an unknown reason.
It is unclear to me if this email thread indicates that you have tried
to rebuild that and are experiencing the same issue. Where you able to
try running that test shared library example I gave and did it work? Did
it work as the same user you are trying to run MVAPICH as?

It seems clear this is a runtime loader problem on Fedora Core 6, or on
your particular configuration there. That is what cannot find the
library. It is similar to the libtest code I provided as an example:

[rowland at e14-oib libtest]$ ls
Makefile test.c test.h test-program.c

[rowland at e14-oib libtest]$ make normal
gcc -c -fPIC test.c
gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o
ln -s libtest.so.1.0 libtest.so.1
ln -s libtest.so.1 libtest.so
gcc -c -o test-program.o test-program.c
gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest

[rowland at e14-oib libtest]$ ldd test-program
libtest.so.1 => not found
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
/lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
./test-program: error while loading shared libraries: libtest.so.1: 
cannot open shared object file: No such file or directory

[rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD

[rowland at e14-oib libtest]$ ldd test-program
libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 
(0x00002abbf9aee000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
/lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
In shared library function...

In previous email your ldd output showed the library was being found:

Please see the output of ldd /usr/local/mvapich2/bin/mpdroot :
[root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot
linux-gate.so.1 => (0xffffe000)
librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000)
libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000)
libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000)
libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000)
libc.so.6 => /lib/libc.so.6 (0x00ca7000)
libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000)
libdl.so.2 => /lib/libdl.so.2 (0x00de6000)
libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000)
/lib/ld-linux.so.2 (0x002d8000)

But that path is different than the one you are quoting above. Does an
ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the
same user you are trying to run it as?

I have one more idea for you to try here. You can do the following:

export LD_DEBUG=all
/root/0.9.8-RELEASE/bin/mpdroot >&output
unset LD_DEBUG

Then take a look at the output file to see if there are any relevant
error messages. Don't forget to unset LD_DEBUG before doing anything else.

Also, just to be sure, if you run "file 
" what
does it say? It should indicate that it is a shared library as similarly to:

[rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so*
/usr/local/ofed/lib64/librdmacm.so: symbolic link to 
`librdmacm.so.0.9.0'
/usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, 
AMD x86-64, version 1 (SYSV), not stripped

Unfortunately, we do not have any Fedora Core 6 systems to investigate
this problem on at this time, and I don't know anything about what might
be there that would cause a problem. As far as I know, there shouldn't
be. However, it seems there is some runtime issue on your Fedora Core 6
machine or with how this is being run there. If it is in fact working on
another distribution as you indicated in your previous response to us,
then that also strongly points in this direction.
-- 
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


---------------------------------
Check out the all-new Yahoo! Mail beta - Fire up a more powerful email and get things done faster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061201/d8f3ca7f/attachment.html>

From halr at voltaire.com  Sat Dec  2 04:27:37 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Dec 2006 07:27:37 -0500
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <20061202012054.GH32366@obsidianresearch.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201231753.GG32366@obsidianresearch.com>
	<1165016801.11808.196277.camel@hal.voltaire.com>
	<20061202012054.GH32366@obsidianresearch.com>
Message-ID: <1165062394.11808.222794.camel@hal.voltaire.com>

On Fri, 2006-12-01 at 20:20, Jason Gunthorpe wrote:
> On Fri, Dec 01, 2006 at 06:47:38PM -0500, Hal Rosenstock wrote:
> 
> > > It isn't in the IB spec, but what would really help here is to be able
> > > to join a multicast prefix (more than 1 group with a single entry).
> > > 
> > > Todd's option 1 optimization is then easially realized by having all
> > > IPv6 nodes join FF02::1:FF00:0/104 (all 2**24 multicast entries)
> > 
> > These are IPmc groups not IB mc groups though. I suppose you are asking
> > for the equivalent function in IB. When that subscribe is done,
> > would it
> 
> Right, it looks like the MGID would be close to
> FF1E:601B:xxxx::1FF00:0/104 for IPv6 SN multicast (RFC4391).
> 
> My thinking was to add a new 8 bit field to MCMemberRecord called
> prefixLen. Broadly (without considering how to manage compatability)
> joins like we have today would set prefixLen to 128. To do this
> suggestion we'd set prefixLen to 104.
> 
> prefixLen of 104 means 2**24 MGID addresses are matched by the join
> and the node is subscribed to them all. The existing MGID field is
> used to encode the prefix bits, only the first 104 bits are used.
> 
> Off hand I don't see a way to indicate the length using the existing
> record fields.

Another 8 bit field could do this if this were needed.

> If we call a MCMemberRecord with a prefixLen != 128 a prefix join..

And hence the backward compatibility issue. One way to handle this would
be an exception (if component mask does not specify PrefixLength rather
than being wildcarded, it assumes PrefixLength of 128. There may be
others.

> MLID mapping is a little tricky in this scheme since once a prefix
> join is registered you have to start unioning membership lists with
> other joins to get the right spans. (ie joins to FF1E::1000/120 and
> FF1E::1001/128 may have different MLIDs but they would both reach a
> unioned membership)
> 
> That would mean that a send-only prefix join MLID would effectively be
> a broadcast MLID so if you use it with option 2 you reduce the SM
> query rate and subscription load, but you are just broadcasting ND
> packets. I guess the sensible use would be with option 1 where both
> the send-only and and full-membership join are a /104 prefix join.
> [Basically, it ends up using broadcasting like Todd sugested, but in
>  a way where the SM can properly integrate IPoIB stacks that
>  don't use prefix joins.]
> 
> I don't know if this is worth persuing.. Certainly if the main issue
> is just MLID usage then option 2 is much simpler. Something like this
> might be part of improving IPv6 ND scalability but that is a different
> problem entirely (and does anyone care?)..

Is that only an IPoIB issue though or is it more generic and apply to
other networks ?

-- Hal

> Jason


From eitan at mellanox.co.il  Sat Dec  2 07:51:53 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 02 Dec 2006 17:51:53 +0200
Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In
 __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition
In-Reply-To: <1165005117.11808.189660.camel@hal.voltaire.com>
References: <1165005117.11808.189660.camel@hal.voltaire.com>
Message-ID: <4571A119.9040408@mellanox.co.il>

Hi Hal,

I see you are doing some work on optimizing the locking scheme in the 
multicast registration flow
 (join and leave).

What kind of testing do you do?
In the simulated environment we do not have currently a test that will 
fire pairs of join/leave or
join/leave/join and verify correctness.Maybe we should have one written.

Eitan

Hal Rosenstock wrote:
> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate
> unneeded lock acquisition
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
>
> diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
> index f7f879b..d6c6968 100644
> --- a/osm/opensm/osm_sa_mcmember_record.c
> +++ b/osm/opensm/osm_sa_mcmember_record.c
> @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp(
>            new_join_state | (p_mcm_port->scope_state & 0xf0);
>  
>          mcmember_rec.scope_state = p_mcm_port->scope_state;
> +
> +        CL_PLOCK_RELEASE( p_rcv->p_lock );
>        }
>        else
>        {
> @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp(
>                     "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: "
>                     "osm_sm_mcgrp_leave failed\n" );
>          }
> -
> -        CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
> -        /* Note: The deletion of the mgrp itself will be done in the callback
> -           for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */
>        }
>      }
>      else
> @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp(
>      goto Exit;
>    }
>  
> -  CL_PLOCK_RELEASE( p_rcv->p_lock );
> -
>    /* Send an SA response */
>    __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );
>  
>
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Sat Dec  2 08:01:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Dec 2006 11:01:23 -0500
Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In
 __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition
In-Reply-To: <4571A119.9040408@mellanox.co.il>
References: <1165005117.11808.189660.camel@hal.voltaire.com>
	<4571A119.9040408@mellanox.co.il>
Message-ID: <1165075257.11808.230252.camel@hal.voltaire.com>

Hi Eitan,

On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote:
> Hi Hal,
> 
> I see you are doing some work on optimizing the locking scheme in the 
> multicast registration flow
>  (join and leave).

Yes and it goes further than this. Additional patch(es) will be coming.

So does this look OK to you ?

> What kind of testing do you do?

Two fold:
1. Tested in another simulated environment
2. Tested in a large cluster where there is a larger join/leave race
issue which started us down the road looking at these code paths more

> In the simulated environment we do not have currently a test that will 
> fire pairs of join/leave or
> join/leave/join and verify correctness.Maybe we should have one written.

Sure; you are welcome to add one. I don't have time to do this now.

-- Hal

> Eitan
> 
> Hal Rosenstock wrote:
> > OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate
> > unneeded lock acquisition
> >
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> >
> > diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
> > index f7f879b..d6c6968 100644
> > --- a/osm/opensm/osm_sa_mcmember_record.c
> > +++ b/osm/opensm/osm_sa_mcmember_record.c
> > @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp(
> >            new_join_state | (p_mcm_port->scope_state & 0xf0);
> >  
> >          mcmember_rec.scope_state = p_mcm_port->scope_state;
> > +
> > +        CL_PLOCK_RELEASE( p_rcv->p_lock );
> >        }
> >        else
> >        {
> > @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp(
> >                     "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: "
> >                     "osm_sm_mcgrp_leave failed\n" );
> >          }
> > -
> > -        CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
> > -        /* Note: The deletion of the mgrp itself will be done in the callback
> > -           for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */
> >        }
> >      }
> >      else
> > @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp(
> >      goto Exit;
> >    }
> >  
> > -  CL_PLOCK_RELEASE( p_rcv->p_lock );
> > -
> >    /* Send an SA response */
> >    __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );
> >  
> >
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From eitan at mellanox.co.il  Sat Dec  2 08:13:27 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 02 Dec 2006 18:13:27 +0200
Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In
 __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition
In-Reply-To: <1165075257.11808.230252.camel@hal.voltaire.com>
References: <1165005117.11808.189660.camel@hal.voltaire.com>
	<4571A119.9040408@mellanox.co.il>
	<1165075257.11808.230252.camel@hal.voltaire.com>
Message-ID: <4571A627.8090501@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote:
>   
>> Hi Hal,
>>
>> I see you are doing some work on optimizing the locking scheme in the 
>> multicast registration flow
>>  (join and leave).
>>     
>
> Yes and it goes further than this. Additional patch(es) will be coming.
>
> So does this look OK to you ?
>
>   
I hope Yevgeny will be able to review the entire flow next week.
>> What kind of testing do you do?
>>     
>
> Two fold:
> 1. Tested in another simulated environment
> 2. Tested in a large cluster where there is a larger join/leave race
> issue which started us down the road looking at these code paths more
>
>   
>> In the simulated environment we do not have currently a test that will 
>> fire pairs of join/leave or
>> join/leave/join and verify correctness.Maybe we should have one written.
>>     
>
> Sure; you are welcome to add one. I don't have time to do this now.
>   
I will try and get to that next week. I will let you know when it is 
available.
> -- Hal
>
>   
>> Eitan
>>
>> Hal Rosenstock wrote:
>>     
>>> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate
>>> unneeded lock acquisition
>>>
>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
>>>
>>> diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
>>> index f7f879b..d6c6968 100644
>>> --- a/osm/opensm/osm_sa_mcmember_record.c
>>> +++ b/osm/opensm/osm_sa_mcmember_record.c
>>> @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp(
>>>            new_join_state | (p_mcm_port->scope_state & 0xf0);
>>>  
>>>          mcmember_rec.scope_state = p_mcm_port->scope_state;
>>> +
>>> +        CL_PLOCK_RELEASE( p_rcv->p_lock );
>>>        }
>>>        else
>>>        {
>>> @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp(
>>>                     "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: "
>>>                     "osm_sm_mcgrp_leave failed\n" );
>>>          }
>>> -
>>> -        CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
>>> -        /* Note: The deletion of the mgrp itself will be done in the callback
>>> -           for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */
>>>        }
>>>      }
>>>      else
>>> @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp(
>>>      goto Exit;
>>>    }
>>>  
>>> -  CL_PLOCK_RELEASE( p_rcv->p_lock );
>>> -
>>>    /* Send an SA response */
>>>    __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );
>>>  
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Sat Dec  2 11:29:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 02 Dec 2006 14:29:57 -0500
Subject: [openib-general] OpenSM/osm_sa_mcmember_record.c: In
 __osm_mcmr_rcv_leave_mgrp, eliminate unneeded lock acquisition
In-Reply-To: <4571A627.8090501@mellanox.co.il>
References: <1165005117.11808.189660.camel@hal.voltaire.com>
	<4571A119.9040408@mellanox.co.il>
	<1165075257.11808.230252.camel@hal.voltaire.com>
	<4571A627.8090501@mellanox.co.il>
Message-ID: <1165087712.11808.237780.camel@hal.voltaire.com>

On Sat, 2006-12-02 at 11:13, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Sat, 2006-12-02 at 10:51, Eitan Zahavi wrote:
> >   
> >> Hi Hal,
> >>
> >> I see you are doing some work on optimizing the locking scheme in the 
> >> multicast registration flow
> >>  (join and leave).
> >>     
> >
> > Yes and it goes further than this. Additional patch(es) will be coming.
> >
> > So does this look OK to you ?
> >
> >   
> I hope Yevgeny will be able to review the entire flow next week.
> >> What kind of testing do you do?
> >>     
> >
> > Two fold:
> > 1. Tested in another simulated environment
> > 2. Tested in a large cluster where there is a larger join/leave race
> > issue which started us down the road looking at these code paths more

Also, the multicast flows in osmtest. (I forgot to mention those).

> >   
> >> In the simulated environment we do not have currently a test that will 
> >> fire pairs of join/leave or
> >> join/leave/join and verify correctness.Maybe we should have one written.
> >>     
> >
> > Sure; you are welcome to add one. I don't have time to do this now.
> >   
> I will try and get to that next week. I will let you know when it is 
> available.

Great; Thanks.

-- Hal

> > -- Hal
> >
> >   
> >> Eitan
> >>
> >> Hal Rosenstock wrote:
> >>     
> >>> OpenSM/osm_sa_mcmember_record.c: In __osm_mcmr_rcv_leave_mgrp, eliminate
> >>> unneeded lock acquisition
> >>>
> >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >>> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
> >>>
> >>> diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
> >>> index f7f879b..d6c6968 100644
> >>> --- a/osm/opensm/osm_sa_mcmember_record.c
> >>> +++ b/osm/opensm/osm_sa_mcmember_record.c
> >>> @@ -1459,6 +1459,8 @@ __osm_mcmr_rcv_leave_mgrp(
> >>>            new_join_state | (p_mcm_port->scope_state & 0xf0);
> >>>  
> >>>          mcmember_rec.scope_state = p_mcm_port->scope_state;
> >>> +
> >>> +        CL_PLOCK_RELEASE( p_rcv->p_lock );
> >>>        }
> >>>        else
> >>>        {
> >>> @@ -1475,10 +1477,6 @@ __osm_mcmr_rcv_leave_mgrp(
> >>>                     "__osm_mcmr_rcv_leave_mgrp: ERR 1B09: "
> >>>                     "osm_sm_mcgrp_leave failed\n" );
> >>>          }
> >>> -
> >>> -        CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
> >>> -        /* Note: The deletion of the mgrp itself will be done in the callback
> >>> -           for the multicast tree updating (osm_mcast_mgr_process_mgrp_cb) */
> >>>        }
> >>>      }
> >>>      else
> >>> @@ -1511,8 +1509,6 @@ __osm_mcmr_rcv_leave_mgrp(
> >>>      goto Exit;
> >>>    }
> >>>  
> >>> -  CL_PLOCK_RELEASE( p_rcv->p_lock );
> >>> -
> >>>    /* Send an SA response */
> >>>    __osm_mcmr_rcv_respond( p_rcv, p_madw, &mcmember_rec );
> >>>  
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> openib-general mailing list
> >>> openib-general at openib.org
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>   
> >>>       
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From jgunthorpe at obsidianresearch.com  Sat Dec  2 11:57:45 2006
From: jgunthorpe at obsidianresearch.com (Jason Gunthorpe)
Date: Sat, 2 Dec 2006 12:57:45 -0700
Subject: [openib-general] IPv6 and IPoIB scalability issue
In-Reply-To: <1165062394.11808.222794.camel@hal.voltaire.com>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E2F2@EPEXCH2.qlogic.org>
	<1165003608.11808.188882.camel@hal.voltaire.com>
	<20061201231753.GG32366@obsidianresearch.com>
	<1165016801.11808.196277.camel@hal.voltaire.com>
	<20061202012054.GH32366@obsidianresearch.com>
	<1165062394.11808.222794.camel@hal.voltaire.com>
Message-ID: <20061202195745.GA19174@obsidianresearch.com>

On Sat, Dec 02, 2006 at 07:27:37AM -0500, Hal Rosenstock wrote:

> > I don't know if this is worth persuing.. Certainly if the main issue
> > is just MLID usage then option 2 is much simpler. Something like this
> > might be part of improving IPv6 ND scalability but that is a different
> > problem entirely (and does anyone care?)..
> 
> Is that only an IPoIB issue though or is it more generic and apply to
> other networks ?

If the IB router spec goes down the path of pushing alot of the
responsability onto the routers then routers will have similar
problems with joining/tracking a large number of gorups. A prefix join
concept might be part of improving that.

I'm not sure what other protocols use extensive multicast like IPv6,
but multicast use on local segments is definately becoming more common.

Is anyone worried about IPv4 broadcast ARP scalability? With RDMA CM
and MPI all-to-all is that going to be a problem? IPv6 SN is a
solution to that ..

Jason


From surs at cse.ohio-state.edu  Sat Dec  2 13:34:56 2006
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Sat, 2 Dec 2006 16:34:56 -0500
Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in
	"send_lat"
Message-ID: <20061202213454.GB31661@cse.ohio-state.edu>

Hi,

I have a question about the "status" field for a completion which is due
to RNR retry exceeded error. I trivially modified the `send_lat' program
(from the Gen2 perftest directory) to use SRQ and not post receives
after some specified time. Given the "rnr_retry" attribute of the QP not
to be 7 (infinite retry), I'm expecting the sender to get an erroneous
completion with IBV_WC_RNR_RETRY_EXC_ERR.

So far so good ... however, the completion I pull out of the send_cq,
lists the opcode of the completion to be IBV_WC_RECV! Is this expected?

I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs
(two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant
Update 3), with kernel version 2.6.17.7.

If someone could explain this behavior, or suggest a workaround, it'd be
great.

TIA,
Sayantan.

=======


<--Print out at client-->
Send Completion wth error at client:
wc.status 13, IBV_WC_RNR_RETRY_EXC_ERR 13, wc.opcode 128
Failed status 13: wr_id 1
scnt=26, rcnt=25, ccnt=0
<--Print out-->


<--Poll CQ snippet-->
            /* poll on scq */
            do {
                ne = ibv_poll_cq(ctx->scq, 1, &wc);
            } while (!user_param->use_event && ne < 1);

            if (ne < 0) {
                fprintf(stderr, "poll SCQ failed %d\n", ne);
                return 12;
            }
            if (wc.status != IBV_WC_SUCCESS) {
                fprintf(stderr, "Send Completion wth error at %s:\n",
                    user_param->servername ? "client" : "server");
                fprintf(stderr, "wc.status %d, IBV_WC_RNR_RETRY_EXC_ERR
%d, wc.opcode %d\n",
                        wc.status, IBV_WC_RNR_RETRY_EXC_ERR, wc.opcode);
                fprintf(stderr, "Failed status %d: wr_id %d\n",
                    wc.status, (int) wc.wr_id);
                fprintf(stderr, "scnt=%d, rcnt=%d, ccnt=%d\n",
                    scnt, rcnt, ccnt);

                {
                   ...
<--Poll CQ snippet-->


-- 
http://www.cse.ohio-state.edu/~surs


From swise at opengridcomputing.com  Sat Dec  2 14:49:17 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:49:17 -0600
Subject: [openib-general] [PATCH  v2 00/13] 2.6.20 Chelsio T3 RDMA Driver
Message-ID: <20061202224917.27014.15424.stgit@dell3.ogc.int>


Version 2 changes:

- Make code sparse endian clean
- Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays
- Clean up confusing bitfields
- Use random32() instead of local random function
- Use krefs to track endpoint reference counts
- Misc nits

-----

The following series implements the Chelsio T3 iWARP/RDMA Driver to
be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
Ethernet Driver which is also under review now for 2.6.20. See:

http://www.mail-archive.com/netdev at vger.kernel.org/msg26619.html

The patches are against 2.6.19.

This patch series can also be pulled from:

	http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v2.tar.bz2

The Chelsio T3 Ethernet Driver patch can be pulled from:

	http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

A complete GIT kernel tree with all the T3 drivers can be pulled from:

	git://staging.openfabrics.org/~swise/cxgb3.git

Thanks,

Steve.


From swise at opengridcomputing.com  Sat Dec  2 14:49:27 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:49:27 -0600
Subject: [openib-general] [PATCH  v2 01/13] Linux RDMA Core Changes
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202224927.27014.24669.stgit@dell3.ogc.int>


Support provider-specific data in ib_uverbs_cmd_req_notify_cq().
The Chelsio iwarp provider library needs to pass information to the
kernel verb for re-arming the CQ.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/uverbs_cmd.c      |    9 +++++++--
 drivers/infiniband/hw/amso1100/c2.h       |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c    |    3 ++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    3 ++-
 drivers/infiniband/hw/ehca/ehca_reqs.c    |    3 ++-
 drivers/infiniband/hw/ipath/ipath_cq.c    |    4 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |    3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c    |    6 ++++--
 drivers/infiniband/hw/mthca/mthca_dev.h   |    4 ++--
 include/rdma/ib_verbs.h                   |    5 +++--
 10 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..5dd1de9 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_udata		      udata;
 	struct ib_cq                  *cq;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 	if (!cq)
 		return -EINVAL;
 
-	ib_req_notify_cq(cq, cmd.solicited_only ?
-			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+	INIT_UDATA(&udata, buf + sizeof cmd, 0,
+		   in_len - sizeof cmd, 0); 
+
+	cq->device->req_notify_cq(cq, cmd.solicited_only ?
+				  IB_CQ_SOLICITED : IB_CQ_NEXT_COMP,
+				  &udata);
 
 	put_cq_read(cq);
 
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
index 1b17dcd..716f9dc 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..7ce8bca 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n
 	return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+	      struct ib_udata *udata)
 {
 	struct c2_mq_shared __iomem *shared;
 	struct c2_cq *cq;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 3720e30..566b30c 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			     struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..3ed6992 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,7 +634,8 @@ poll_cq_exit0:
 	return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 87462e0..27ba4db 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
  * @notify: the type of notification to request
+ * @udata: user data 
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
 	unsigned long flags;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 8039f6e..0d39960 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_
 
 int ipath_destroy_cq(struct ib_cq *ibcq);
 
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata);
 
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 149b369..ec7bb79 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -723,7 +723,8 @@ repoll:
 	return err == 0 || err == -EAGAIN ? npolled : err;
 }
 
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
 {
 	__be32 doorbell[2];
 
@@ -740,7 +741,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
 	return 0;
 }
 
-int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+		       struct ib_udata *udata)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	__be32 doorbell[2];
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index fe5cecf..6b9ccf6 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
 
 int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
+int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8eacc35..e3e1a2c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -941,7 +941,8 @@ struct ib_device {
 					      struct ib_wc *wc);
 	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
 	int                        (*req_notify_cq)(struct ib_cq *cq,
-						    enum ib_cq_notify cq_notify);
+						    enum ib_cq_notify cq_notify,
+						    struct ib_udata *udata);
 	int                        (*req_ncomp_notif)(struct ib_cq *cq,
 						      int wc_cnt);
 	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
@@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
 static inline int ib_req_notify_cq(struct ib_cq *cq,
 				   enum ib_cq_notify cq_notify)
 {
-	return cq->device->req_notify_cq(cq, cq_notify);
+	return cq->device->req_notify_cq(cq, cq_notify, NULL);
 }
 
 /**


From swise at opengridcomputing.com  Sat Dec  2 14:49:37 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:49:37 -0600
Subject: [openib-general] [PATCH v2 02/13] Device Discovery and ULLD Linkage
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202224937.27014.951.stgit@dell3.ogc.int>


Code to discover all the T3 devices and register them 
with the T3 RDMA Core and the Linux RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch.c |  189 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch.h |  175 +++++++++++++++++++++++++++++++++
 2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 0000000..acbe449
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define DRV_VERSION "1.1"
+
+MODULE_AUTHOR("Boyd Faulkner, Steve Wise");
+MODULE_DESCRIPTION("Chelsio T3 RDMA Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_dev(struct t3cdev *);
+static void close_rnic_dev(struct t3cdev *);
+
+struct cxgb3_client t3c_client = {
+	.name = "iw_cxgb3",
+	.add = open_rnic_dev,
+	.remove = close_rnic_dev,
+	.handlers = t3c_handlers,
+	.redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static void rnic_init(struct iwch_dev *rnicp)
+{
+	PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
+	idr_init(&rnicp->cqidr);
+	idr_init(&rnicp->qpidr);
+	idr_init(&rnicp->mmidr);
+	spin_lock_init(&rnicp->lock);
+
+	rnicp->attr.vendor_id = 0x168;
+	rnicp->attr.vendor_part_id = 7;
+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+	rnicp->attr.max_wrs = (1UL << 24) - 1;
+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+	rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev);
+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
+	rnicp->attr.can_resize_wq = 0;
+	rnicp->attr.max_rdma_reads_per_qp = 8;
+	rnicp->attr.max_rdma_read_resources =
+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+	rnicp->attr.max_rdma_read_qp_depth = 8;	/* IRD */
+	rnicp->attr.max_rdma_read_depth =
+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+	rnicp->attr.rq_overflow_handled = 0;
+	rnicp->attr.can_modify_ird = 0;
+	rnicp->attr.can_modify_ord = 0;
+	rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
+	rnicp->attr.stag0_value = 1;
+	rnicp->attr.zbva_support = 1;
+	rnicp->attr.local_invalidate_fence = 1;
+	rnicp->attr.cq_overflow_detection = 1;
+	return;
+}
+
+static void open_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *rnicp;
+	static int vers_printed;
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	if (!vers_printed++) 
+		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+		       DRV_VERSION);
+	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+	if (!rnicp) {
+		printk(KERN_ERR MOD "Cannot allocate ib device\n");
+		return;
+	}
+	rnicp->rdev.ulp = rnicp;
+	rnicp->rdev.t3cdev_p = tdev;
+
+	if (cxio_rdev_open(&rnicp->rdev)) {
+		printk(KERN_ERR MOD "Unable to open CXIO rdev\n");
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	rnic_init(rnicp);
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&rnicp->entry, &dev_list);
+	mutex_unlock(&dev_mutex);
+
+	if (iwch_register_device(rnicp)) {
+		printk(KERN_ERR MOD "Unable to register device\n");
+		close_rnic_dev(tdev);
+	}
+	printk(KERN_INFO MOD "Initialized device %s\n",
+	       pci_name(rnicp->rdev.rnic_info.pdev));
+	return;
+}
+
+static void close_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *dev, *tmp;
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	mutex_lock(&dev_mutex);
+	list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
+		if (dev->rdev.t3cdev_p == tdev) {
+			list_del(&dev->entry);
+			iwch_unregister_device(dev);
+			cxio_rdev_close(&dev->rdev);
+			idr_destroy(&dev->cqidr);
+			idr_destroy(&dev->qpidr);
+			idr_destroy(&dev->mmidr);
+			ib_dealloc_device(&dev->ibdev);
+			break;
+		}
+	}
+	mutex_unlock(&dev_mutex);
+}
+
+extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb);
+
+static int __init iwch_init_module(void)
+{
+	int err;
+
+	err = cxio_hal_init();
+	if (err) 
+		return err;
+	err = iwch_cm_init();
+	if (err) 
+		return err;
+	cxio_register_ev_cb(iwch_ev_dispatch);
+	cxgb3_register_client(&t3c_client);
+	return 0;
+}
+
+static void __exit iwch_exit_module(void)
+{
+	cxgb3_unregister_client(&t3c_client);
+	cxio_unregister_ev_cb(iwch_ev_dispatch);
+	iwch_cm_term();
+	cxio_hal_exit();
+}
+
+module_init(iwch_init_module);
+module_exit(iwch_exit_module);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
new file mode 100644
index 0000000..411bfcd
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_H__
+#define __IWCH_H__
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/idr.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+
+struct iwch_pd;
+struct iwch_cq;
+struct iwch_qp;
+struct iwch_mr;
+
+struct iwch_rnic_attributes {
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 max_qps;
+	u32 max_wrs;				/* Max for any SQ/RQ */
+	u32 max_sge_per_wr;
+	u32 max_sge_per_rdma_write_wr;	/* for RDMA Write WR */
+	u32 max_cqs;
+	u32 max_cqes_per_cq;
+	u32 max_mem_regs;
+	u32 max_phys_buf_entries;		/* for phys buf list */
+	u32 max_pds;
+
+	/* 
+	 * The memory page sizes supported by this RNIC.
+	 * Bit position i in bitmap indicates page of
+	 * size (4k)^i.  Phys block list mode unsupported. 
+	 */
+	u32 mem_pgsizes_bitmask;
+	u8 can_resize_wq;
+
+	/*
+	 * The maximum number of RDMA Reads that can be outstanding 
+	 * per QP with this RNIC as the target. 
+	 */
+	u32 max_rdma_reads_per_qp;
+
+	/*
+	 * The maximum number of resources used for RDMA Reads
+	 * by this RNIC with this RNIC as the target. 
+	 */
+	u32 max_rdma_read_resources;
+
+	/*
+	 * The max depth per QP for initiation of RDMA Read
+	 * by this RNIC.  
+	 */
+	u32 max_rdma_read_qp_depth;
+
+	/*
+	 * The maximum depth for initiation of RDMA Read 
+	 * operations by this RNIC on all QPs 
+	 */
+	u32 max_rdma_read_depth;
+	u8 rq_overflow_handled;
+	u32 can_modify_ird;
+	u32 can_modify_ord;
+	u32 max_mem_windows;
+	u32 stag0_value;
+	u8 zbva_support;
+	u8 local_invalidate_fence;
+	u32 cq_overflow_detection;
+};
+
+struct iwch_dev {
+	struct ib_device ibdev;
+	struct cxio_rdev rdev;
+	u32 device_cap_flags;
+	struct iwch_rnic_attributes attr;
+	struct idr cqidr;
+	struct idr qpidr;
+	struct idr mmidr;
+	spinlock_t lock;
+	struct list_head entry;
+};
+
+static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct iwch_dev, ibdev);
+}
+
+static inline int t3b_device(struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3B);
+}
+
+static inline int t3a_device(struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3A);
+}
+
+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid)
+{
+	return idr_find(&rhp->cqidr, cqid);
+}
+
+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid)
+{
+	return idr_find(&rhp->qpidr, qpid);
+}
+
+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid)
+{
+	return idr_find(&rhp->mmidr, mmid);
+}
+
+static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, 
+				void *handle, u32 id)
+{
+	int ret;
+	u32 newid;
+
+	do {
+		if (!idr_pre_get(idr, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		spin_lock_irq(&rhp->lock);
+		ret = idr_get_new_above(idr, handle, id, &newid);
+		BUG_ON(newid != id);
+		spin_unlock_irq(&rhp->lock);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
+{
+	spin_lock_irq(&rhp->lock);
+	idr_remove(idr, id);
+	spin_unlock_irq(&rhp->lock);
+}
+
+extern struct cxgb3_client t3c_client;
+extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:49:47 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:49:47 -0600
Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data
	Structures
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202224947.27014.59189.stgit@dell3.ogc.int>


Provider methods to support the Linux RDMA verbs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1170 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  362 ++++++++
 drivers/infiniband/hw/cxgb3/iwch_user.h     |   68 ++
 3 files changed, 1600 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 0000000..4bef081
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/ethtool.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <cxio_hal.h>
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+			    u8 port, int port_modify_mask,
+			    struct ib_port_modify *props)
+{
+	return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+				    struct ib_ah_attr *ah_attr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+			    int mad_flags,
+			    u8 port_num,
+			    struct ib_wc *in_wc,
+			    struct ib_grh *in_grh,
+			    struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+	struct iwch_dev *rhp = to_iwch_dev(context->device);
+	struct iwch_ucontext *ucontext = to_iwch_ucontext(context);
+	PDBG("%s context %p\n", __FUNCTION__, context);
+	cxio_release_ucontext(&rhp->rdev, &ucontext->uctx);
+	kfree(ucontext);
+	return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+					struct ib_udata *udata)
+{
+	struct iwch_ucontext *context;
+	struct iwch_dev *rhp = to_iwch_dev(ibdev);
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+	cxio_init_ucontext(&rhp->rdev, &context->uctx);
+	INIT_LIST_HEAD(&context->mmaps);
+	return &context->ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct iwch_cq *chp;
+
+	PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq);
+	chp = to_iwch_cq(ib_cq);
+
+	remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid);
+	atomic_dec(&chp->refcnt);
+	wait_event(chp->wait, !atomic_read(&chp->refcnt));
+
+	cxio_destroy_cq(&chp->rhp->rdev, &chp->cq);
+	kfree(chp);
+	return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+			     struct ib_ucontext *context,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	struct iwch_create_cq_resp uresp;
+
+	PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries);
+	rhp = to_iwch_dev(ibdev);
+	chp = kzalloc(sizeof(*chp), GFP_KERNEL);
+	if (!chp)
+		return ERR_PTR(-ENOMEM);
+
+	if (t3a_device(rhp)) {
+
+		/*
+		 * T3A: Add some fluff to handle extra CQEs inserted 
+	 	 * for various errors.
+		 * Additional CQE possibilities:
+		 *      TERMINATE,
+		 *      incoming RDMA WRITE Failures
+		 *      incoming RDMA READ REQUEST FAILUREs
+		 * NOTE: We cannot ensure the CQ won't overflow.
+		 */
+		entries += 16; 
+	}
+	entries = roundup_pow_of_two(entries);
+	chp->cq.size_log2 = long_log2(entries);
+
+	if (cxio_create_cq(&rhp->rdev, &chp->cq)) {
+		kfree(chp);
+		return ERR_PTR(-ENOMEM);
+	}
+	chp->rhp = rhp;
+	chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1;
+	spin_lock_init(&chp->lock);
+	atomic_set(&chp->refcnt, 1);
+	init_waitqueue_head(&chp->wait);
+	insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid);
+
+	if (context) {
+		struct iwch_mm_entry *mm;
+
+		mm = kmalloc(sizeof *mm, GFP_KERNEL);
+		if (!mm) {
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-ENOMEM);
+		}
+		uresp.cqid = chp->cq.cqid;
+		uresp.size_log2 = chp->cq.size_log2;
+		uresp.physaddr = virt_to_phys(chp->cq.queue);
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm);
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-EFAULT);
+		}
+		mm->addr = uresp.physaddr;
+		mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * 
+					     sizeof (struct t3_cqe));
+		insert_mmap(to_iwch_ucontext(context), mm);
+	}
+	PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n",
+	     chp->cq.cqid, chp, (1 << chp->cq.size_log2), 
+	     (u64)chp->cq.dma_addr);
+	return &chp->ibcq;
+}
+
+static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
+{
+	struct iwch_cq *chp = to_iwch_cq(cq);
+	struct t3_cq oldcq, newcq;
+	int ret;
+
+	PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe);
+
+	/* We don't downsize... */
+	if (cqe <= cq->cqe)
+		return 0;
+
+	/* create new t3_cq with new size */
+	cqe = roundup_pow_of_two(cqe+1);
+	newcq.size_log2 = long_log2(cqe);
+
+	/* Dont allow resize to less than the current wce count */
+	if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) {
+		return -ENOMEM;
+	}
+
+	/* Quiesce all QPs using this CQ */
+	ret = iwch_quiesce_qps(chp);
+	if (ret) {
+		return ret;
+	}
+
+	ret = cxio_create_cq(&chp->rhp->rdev, &newcq);
+	if (ret) {
+		kfree(chp);
+		return ret;
+	}
+	
+	/* copy CQEs */
+	memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * 
+				        sizeof(struct t3_cqe));
+
+	/* old iwch_qp gets new t3_cq but keeps old cqid */
+	oldcq = chp->cq;
+	chp->cq = newcq;
+	chp->cq.cqid = oldcq.cqid;
+
+	/* resize new t3_cq to update the HW context */
+	ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq);
+	if (ret) {
+		chp->cq = oldcq;
+		return ret;
+	}
+	chp->ibcq.cqe = (1<<chp->cq.size_log2) - 1;
+
+	/* destroy old t3_cq */
+	oldcq.cqid = newcq.cqid;
+	ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq);
+	if (ret) {
+		printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", 
+			__FUNCTION__, ret);
+	}
+	
+	/* add user hooks here */
+
+	/* resume qps */
+	ret = iwch_resume_qps(chp);
+	return ret;
+}
+
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	enum t3_cq_opcode cq_op;
+	int err;
+	unsigned long flag;
+	struct iwch_req_notify_cq ucmd;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+	if (notify == IB_CQ_SOLICITED)
+		cq_op = CQ_ARM_SE;
+	else
+		cq_op = CQ_ARM_AN;
+	if (udata && t3b_device(rhp)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+			return -EFAULT;
+		spin_lock_irqsave(&chp->lock, flag);
+		chp->cq.rptr = ucmd.rptr;
+	} else
+		spin_lock_irqsave(&chp->lock, flag);
+	PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr);
+	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
+	spin_unlock_irqrestore(&chp->lock, flag);
+	if (err) 
+		printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, 
+		       chp->cq.cqid);
+	return err;
+}
+
+static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	int len = vma->vm_end - vma->vm_start;
+	u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT;
+	struct cxio_rdev *rdev_p;
+	int ret = 0;
+	struct iwch_mm_entry *mm;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, 
+	     pgaddr, len);
+
+	if (vma->vm_start & (PAGE_SIZE-1)) {
+                return -EINVAL;
+        }
+
+	rdev_p = &(to_iwch_dev(context->device)->rdev);
+	ucontext = to_iwch_ucontext(context);
+
+	mm = remove_mmap(ucontext, pgaddr, len);
+	if (!mm)
+		return -EINVAL;
+	kfree(mm);
+
+	if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && 
+	    (pgaddr < (rdev_p->rnic_info.udbell_physbase + 
+		       rdev_p->rnic_info.udbell_len))) {
+
+		/*
+		 * Map T3 DB register.
+		 */
+		if (vma->vm_flags & VM_READ) {
+                	return -EPERM;
+		}
+
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+		vma->vm_flags &= ~VM_MAYREAD;
+		ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	} else {
+
+		/*
+		 * Map WQ or CQ contig dma memory...
+		 */
+		ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	}
+	
+	return ret;
+}
+
+static int iwch_deallocate_pd(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid);
+	cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid);
+	kfree(php);
+	return 0;
+}
+
+static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev,
+			       struct ib_ucontext *context,
+			       struct ib_udata *udata)
+{
+	struct iwch_pd *php;
+	u32 pdid;
+	struct iwch_dev *rhp;
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	rhp = (struct iwch_dev *) ibdev;
+	pdid = cxio_hal_get_pdid(rhp->rdev.rscp);
+	if (!pdid)
+		return ERR_PTR(-EINVAL);
+	php = kzalloc(sizeof(*php), GFP_KERNEL);
+	if (!php) {
+		cxio_hal_put_pdid(rhp->rdev.rscp, pdid);
+		return ERR_PTR(-ENOMEM);
+	}
+	php->pdid = pdid;
+	php->rhp = rhp;
+	if (context) {
+		if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) {
+			iwch_deallocate_pd(&php->ibpd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php);
+	return &php->ibpd;
+}
+ 
+static int iwch_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mr *mhp;
+	u32 mmid;
+
+	PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr);
+	/* There can be no memory windows */
+	if (atomic_read(&ib_mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(ib_mr);
+	rhp = mhp->rhp;
+	mmid = mhp->attr.stag >> 8;
+	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, 
+		       mhp->attr.pbl_addr);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	if (mhp->kva)
+		kfree((void *) (unsigned long) mhp->kva);
+	PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp);
+	kfree(mhp);
+	return 0;
+}
+
+static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
+					struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					int acc,
+					u64 *iova_start)
+{
+	__be64 *page_list;
+	int shift;
+	u64 total_size;
+	int npages;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	int ret;
+		
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+
+	acc = iwch_convert_access(acc);
+
+	
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
+			 	   &total_size, &npages, &shift, &page_list);
+	if (ret) 
+		goto err;
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+
+	/* NOTE: TPT perms are backwards from BIND WR perms! */
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+
+	mhp->attr.va_fbo = *iova_start;
+	mhp->attr.page_size = shift - 12;
+
+	mhp->attr.len = (u32) total_size;
+	mhp->attr.pbl_size = npages;
+	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		goto err;
+	}
+	return &mhp->ibmr;
+err:
+	kfree(mhp);
+	return ERR_PTR(ret);
+	
+}
+
+static int iwch_reregister_phys_mem(struct ib_mr *mr, 
+				     int mr_rereg_mask,
+				     struct ib_pd *pd,
+                                     struct ib_phys_buf *buffer_list,
+                                     int num_phys_buf,
+                                     int acc, u64 * iova_start)
+{
+
+	struct iwch_mr mh, *mhp;
+	struct iwch_pd *php;
+	struct iwch_dev *rhp;
+	int new_acc;
+	__be64 *page_list = NULL;
+	int shift = 0;
+	u64 total_size;
+	int npages;
+	int ret;
+
+	PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd);
+
+	/* There can be no memory windows */
+	if (atomic_read(&mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(mr);
+	rhp = mhp->rhp;
+	php = to_iwch_pd(mr->pd);
+
+	/* make sure we are on the same adapter */
+	if (rhp != php->rhp)
+		return -EINVAL;
+
+	new_acc = mhp->attr.perms;
+
+	memcpy(&mh, mhp, sizeof *mhp);
+
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		php = to_iwch_pd(pd);
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mh.attr.perms = iwch_convert_access(acc);
+	if (mr_rereg_mask & IB_MR_REREG_TRANS)
+		ret = build_phys_page_list(buffer_list, num_phys_buf, 
+					   iova_start,
+					   &total_size, &npages, 
+					   &shift, &page_list);
+
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages);
+	kfree(page_list);
+	if (ret) {
+		return ret;
+	}
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		mhp->attr.pdid = php->pdid;
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mhp->attr.perms = acc;
+	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		mhp->attr.zbva = 0;
+		mhp->attr.va_fbo = *iova_start;
+		mhp->attr.page_size = shift - 12;
+		mhp->attr.len = (u32) total_size;
+		mhp->attr.pbl_size = npages;
+	}
+
+	return 0;	
+}
+
+
+struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				      int acc, struct ib_udata *udata)
+{
+	__be64 *pages;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	struct iwch_reg_user_mr_resp uresp;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	shift = ffs(region->page_size) - 1;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	acc = iwch_convert_access(acc);
+
+	i = n = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = cpu_to_be64(sg_dma_address(
+					&chunk->page_list[j]) +
+					region->page_size * k);
+			}
+		}
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+	mhp->attr.va_fbo = region->virt_base;
+	mhp->attr.page_size = shift - 12;
+	mhp->attr.len = (u32) region->length;
+	mhp->attr.pbl_size = i;
+	err = iwch_register_mem(rhp, php, mhp, shift, pages);
+	kfree(pages);
+	if (err)
+		goto err;
+
+	if (udata && t3b_device(rhp)) {
+		uresp.pbl_addr = (mhp->attr.pbl_addr -
+                                 rhp->rdev.rnic_info.pbl_base) >> 3;
+		PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, 
+		     uresp.pbl_addr);
+			
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			iwch_dereg_mr(&mhp->ibmr);
+			err = -EFAULT;
+			goto err;
+		}
+	}
+
+	return &mhp->ibmr;
+
+err:
+	kfree(mhp);
+	return ERR_PTR(err);
+}
+
+struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva;
+	struct ib_mr *ibmr;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+
+	/*
+	 * T3 only supports 32 bits of size.
+	 */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	kva = 0;
+	ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva);
+	return ibmr;
+}
+
+struct ib_mw *iwch_alloc_mw(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mw *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+	ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_MW;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag);
+	return &(mhp->ibmw);
+}
+
+int iwch_dealloc_mw(struct ib_mw *mw)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	u32 mmid;
+
+	mhp = to_iwch_mw(mw);
+	rhp = mhp->rhp;
+	mmid = (mw->rkey) >> 8;
+	cxio_deallocate_window(&rhp->rdev, mhp->attr.stag);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	kfree(mhp);
+	PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp);
+	return 0;
+}
+
+static int iwch_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_qp_attributes attrs;
+	struct iwch_ucontext *ucontext;
+
+	qhp = to_iwch_qp(ib_qp);
+	rhp = qhp->rhp;
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
+	}
+	wait_event(qhp->wait, !qhp->ep);
+
+	remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid);
+
+	atomic_dec(&qhp->refcnt);
+	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
+
+	ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) 
+				  : NULL;
+	cxio_destroy_qp(&rhp->rdev, &qhp->wq, 
+			ucontext ? &ucontext->uctx : &rhp->rdev.uctx);
+
+	PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, 
+	     ib_qp, qhp->wq.qpid, qhp);
+	kfree(qhp);
+	return 0;
+}
+
+static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *attrs,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_pd *php;
+	struct iwch_cq *schp;
+	struct iwch_cq *rchp;
+	struct iwch_create_qp_resp uresp;
+	int wqsize, sqsize, rqsize;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	if (attrs->qp_type != IB_QPT_RC) 
+		return ERR_PTR(-EINVAL);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
+	rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid);
+	if (!schp || !rchp)
+		return ERR_PTR(-EINVAL);
+
+	/* The RQT size must be # of entries + 1 rounded up to a power of two */
+	rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr);
+	if (rqsize == attrs->cap.max_recv_wr)
+		rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1);
+
+	/* T3 doesn't support RQT depth < 16 */
+	if (rqsize < 16)
+		rqsize = 16;
+
+	if (rqsize > T3_MAX_RQ_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	/* 
+	 * NOTE: The SQ and total WQ sizes don't need to be
+	 * a power of two.  However, all the code assumes 
+	 * they are. EG: Q_FREECNT() and friends.
+	 */
+	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
+	wqsize = roundup_pow_of_two(rqsize + sqsize);
+	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, 
+	     wqsize, sqsize, rqsize);
+	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
+	if (!qhp)
+		return ERR_PTR(-ENOMEM);
+	qhp->wq.size_log2 = long_log2(wqsize);
+	qhp->wq.rq_size_log2 = long_log2(rqsize);
+	qhp->wq.sq_size_log2 = long_log2(sqsize);
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
+			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
+		kfree(qhp);
+		return ERR_PTR(-ENOMEM);
+	}
+	attrs->cap.max_recv_wr = rqsize - 1;
+	attrs->cap.max_send_wr = sqsize;
+	qhp->rhp = rhp;
+	qhp->attr.pd = php->pdid;
+	qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid;
+	qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid;
+	qhp->attr.sq_num_entries = attrs->cap.max_send_wr;
+	qhp->attr.rq_num_entries = attrs->cap.max_recv_wr;
+	qhp->attr.sq_max_sges = attrs->cap.max_send_sge;
+	qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge;
+	qhp->attr.rq_max_sges = attrs->cap.max_recv_sge;
+	qhp->attr.state = IWCH_QP_STATE_IDLE;
+	qhp->attr.next_state = IWCH_QP_STATE_IDLE;
+
+	/* 
+	 * XXX - These don't get passed in from the openib user
+ 	 * at create time.  The CM sets them via a QP modify.
+	 * Need to fix...  I think the CM should 
+	 */
+	qhp->attr.enable_rdma_read = 1;
+	qhp->attr.enable_rdma_write = 1;
+	qhp->attr.enable_bind = 1;
+	qhp->attr.max_ord = 1;
+	qhp->attr.max_ird = 1;
+
+	spin_lock_init(&qhp->lock);
+	init_waitqueue_head(&qhp->wait);
+	atomic_set(&qhp->refcnt, 1);
+	insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid);
+
+	if (udata) {
+
+		struct iwch_mm_entry *mm1, *mm2;
+
+		mm1 = kmalloc(sizeof *mm1, GFP_KERNEL);
+		if (!mm1) {
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		mm2 = kmalloc(sizeof *mm2, GFP_KERNEL);
+		if (!mm2) {
+			kfree(mm1);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		uresp.qpid = qhp->wq.qpid;
+		uresp.size_log2 = qhp->wq.size_log2;
+		uresp.sq_size_log2 = qhp->wq.sq_size_log2;
+		uresp.rq_size_log2 = qhp->wq.rq_size_log2;
+		uresp.physaddr = virt_to_phys(qhp->wq.queue);
+		uresp.doorbell = qhp->wq.udb;
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm1);
+			kfree(mm2);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-EFAULT);
+		}
+		mm1->addr = uresp.physaddr;
+		mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr));
+		insert_mmap(ucontext, mm1);
+		mm2->addr = uresp.doorbell & PAGE_MASK;
+		mm2->len = PAGE_SIZE;
+		insert_mmap(ucontext, mm2);
+	}
+	qhp->ibqp.qp_num = qhp->wq.qpid;
+	init_timer(&(qhp->timer));
+	PDBG("%s sq_num_entries %d, rq_num_entries %d "
+	     "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n",
+	     __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries,
+	     qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2);
+	return (&qhp->ibqp);
+}
+
+static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	enum iwch_qp_attr_mask mask = 0;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp);
+
+	/* iwarp does not support the RTR state */
+	if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR))
+		attr_mask &= ~IB_QP_STATE;
+
+	/* Make sure we still have something left to do */
+	if (!attr_mask)
+		return 0;
+
+	memset(&attrs, 0, sizeof attrs);
+	qhp = to_iwch_qp(ibqp);
+	rhp = qhp->rhp;
+
+	attrs.next_state = iwch_convert_state(attr->qp_state);
+	attrs.enable_rdma_read = (attr->qp_access_flags & 
+			       IB_ACCESS_REMOTE_READ) ?  1 : 0;
+	attrs.enable_rdma_write = (attr->qp_access_flags & 
+				IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
+	attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0;
+
+
+	mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0;
+	mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? 
+			(IWCH_QP_ATTR_ENABLE_RDMA_READ |
+			 IWCH_QP_ATTR_ENABLE_RDMA_WRITE | 
+			 IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0;
+
+	return iwch_modify_qp(rhp, qhp, mask, &attrs, 0);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	atomic_inc(&(to_iwch_qp(qp)->refcnt));
+}
+
+void iwch_qp_rem_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt)))
+                wake_up(&(to_iwch_qp(qp)->wait));
+}
+
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn)
+{
+	PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn);
+	return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn);
+}
+
+
+static int iwch_query_pkey(struct ib_device *ibdev,
+			   u8 port, u16 index, u16 * pkey)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	*pkey = 0;
+	return 0;
+}
+
+static int iwch_query_gid(struct ib_device *ibdev, u8 port,
+			  int index, union ib_gid *gid)
+{
+	struct iwch_dev *dev;
+
+	PDBG("%s ibdev %p, port %d, index %d, gid %p\n",
+	       __FUNCTION__, ibdev, port, index, gid);
+	dev = to_iwch_dev(ibdev);
+	BUG_ON(port == 0 || port > 2);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6);
+	return 0;
+}
+
+static int iwch_query_device(struct ib_device *ibdev,
+			     struct ib_device_attr *props)
+{
+
+	struct iwch_dev *dev;
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+
+	dev = to_iwch_dev(ibdev);
+	memset(props, 0, sizeof *props);
+	memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	props->device_cap_flags = dev->device_cap_flags;
+	props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor;
+	props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device;
+	props->max_mr_size = ~0ull;
+	props->max_qp = dev->attr.max_qps;
+	props->max_qp_wr = dev->attr.max_wrs;
+	props->max_sge = dev->attr.max_sge_per_wr;
+	props->max_sge_rd = 1;
+	props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp;
+	props->max_cq = dev->attr.max_cqs;
+	props->max_cqe = dev->attr.max_cqes_per_cq;
+	props->max_mr = dev->attr.max_mem_regs;
+	props->max_pd = dev->attr.max_pds;
+	props->local_ca_ack_delay = 0;
+
+	return 0;
+}
+
+static int iwch_query_port(struct ib_device *ibdev,
+			   u8 port, struct ib_port_attr *props)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_SNMP_TUNNEL_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_DEVICE_MGMT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 2;
+	props->active_speed = 2;
+	props->max_msg_sz = -1;
+
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.fw_version);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.driver);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, dev);
+	return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor,
+		                       dev->rdev.rnic_info.pdev->device);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *iwch_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+int iwch_register_device(struct iwch_dev *dev)
+{
+	int ret;
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->device_cap_flags =
+	    (IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
+	dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
+	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.query_device = iwch_query_device;
+	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.modify_port = iwch_modify_port;
+	dev->ibdev.query_pkey = iwch_query_pkey;
+	dev->ibdev.query_gid = iwch_query_gid;
+	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext;
+	dev->ibdev.mmap = iwch_mmap;
+	dev->ibdev.alloc_pd = iwch_allocate_pd;
+	dev->ibdev.dealloc_pd = iwch_deallocate_pd;
+	dev->ibdev.create_ah = iwch_ah_create;
+	dev->ibdev.destroy_ah = iwch_ah_destroy;
+	dev->ibdev.create_qp = iwch_create_qp;
+	dev->ibdev.modify_qp = iwch_ib_modify_qp;
+	dev->ibdev.destroy_qp = iwch_destroy_qp;
+	dev->ibdev.create_cq = iwch_create_cq;
+	dev->ibdev.destroy_cq = iwch_destroy_cq;
+	dev->ibdev.resize_cq = iwch_resize_cq;
+	dev->ibdev.poll_cq = iwch_poll_cq;
+	dev->ibdev.get_dma_mr = iwch_get_dma_mr;
+	dev->ibdev.reg_phys_mr = iwch_register_phys_mem;
+	dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem;
+	dev->ibdev.reg_user_mr = iwch_reg_user_mr;
+	dev->ibdev.dereg_mr = iwch_dereg_mr;
+	dev->ibdev.alloc_mw = iwch_alloc_mw;
+	dev->ibdev.bind_mw = iwch_bind_mw;
+	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+
+	dev->ibdev.attach_mcast = iwch_multicast_attach;
+	dev->ibdev.detach_mcast = iwch_multicast_detach;
+	dev->ibdev.process_mad = iwch_process_mad;
+
+	dev->ibdev.req_notify_cq = iwch_arm_cq;
+	dev->ibdev.post_send = iwch_post_send;
+	dev->ibdev.post_recv = iwch_post_receive;
+
+
+	dev->ibdev.iwcm =
+	    (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs),
+					   GFP_KERNEL);
+	dev->ibdev.iwcm->connect = iwch_connect;
+	dev->ibdev.iwcm->accept = iwch_accept_cr;
+	dev->ibdev.iwcm->reject = iwch_reject_cr;
+	dev->ibdev.iwcm->create_listen = iwch_create_listen;
+	dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen;
+	dev->ibdev.iwcm->add_ref = iwch_qp_add_ref;
+	dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref;
+	dev->ibdev.iwcm->get_qp = iwch_get_qp;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		goto bail1;
+
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       iwch_class_attributes[i]);
+		if (ret) {
+			goto bail2;
+		}
+	}
+	return 0;
+bail2:
+	ib_unregister_device(&dev->ibdev);
+bail1:
+	return ret;
+}
+
+void iwch_unregister_device(struct iwch_dev *dev)
+{
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i)
+		class_device_remove_file(&dev->ibdev.class_dev,
+					 iwch_class_attributes[i]);
+	ib_unregister_device(&dev->ibdev);
+	return;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
new file mode 100644
index 0000000..76616ac
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -0,0 +1,362 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_PROVIDER_H__
+#define __IWCH_PROVIDER_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include <asm/types.h>
+#include "t3cdev.h"
+#include "iwch.h"
+#include "cxio_wr.h"
+#include "cxio_hal.h"
+
+struct iwch_pd {
+	struct ib_pd ibpd;
+	u32 pdid;
+	struct iwch_dev *rhp;
+};
+
+static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct iwch_pd, ibpd);
+}
+
+struct tpt_attributes {
+	u32 stag;
+	u32 state:1;
+	u32 type:2;
+	u32 rsvd:1;
+	enum tpt_mem_perm perms;
+	u32 remote_invaliate_disable:1;
+	u32 zbva:1;
+	u32 mw_bind_enable:1;
+	u32 page_size:5;
+
+	u32 pdid;
+	u32 qpid;
+	u32 pbl_addr;
+	u32 len;
+	u64 va_fbo;
+	u32 pbl_size;
+};
+
+struct iwch_mr {
+	struct ib_mr ibmr;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+typedef struct iwch_mw iwch_mw_handle;
+
+static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct iwch_mr, ibmr);
+}
+
+struct iwch_mw {
+	struct ib_mw ibmw;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw)
+{
+	return container_of(ibmw, struct iwch_mw, ibmw);
+}
+
+struct iwch_cq {
+	struct ib_cq ibcq;
+	struct iwch_dev *rhp;
+	struct t3_cq cq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+};
+
+static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct iwch_cq, ibcq);
+}
+
+enum IWCH_QP_FLAGS {
+	QP_QUIESCED = 0x01
+};
+
+struct iwch_mpa_attributes {
+	u8 recv_marker_enabled;
+	u8 xmit_marker_enabled;	/* iWARP: enable inbound Read Resp. */
+	u8 crc_enabled;
+	u8 version;	/* 0 or 1 */
+};
+
+struct iwch_qp_attributes {
+	u32 scq;
+	u32 rcq;
+	u32 sq_num_entries;
+	u32 rq_num_entries;
+	u32 sq_max_sges;
+	u32 sq_max_sges_rdma_write;
+	u32 rq_max_sges;
+	u32 state;
+	u8 enable_rdma_read;
+	u8 enable_rdma_write;	/* enable inbound Read Resp. */
+	u8 enable_bind;
+	u8 enable_mmid0_fastreg;	/* Enable STAG0 + Fast-register */
+	/*
+	 * Next QP state. If specify the current state, only the 
+	 * QP attributes will be modified.
+	 */
+	u32 max_ord;
+	u32 max_ird;
+	u32 pd;	/* IN */
+	u32 next_state;
+	char terminate_buffer[52];
+	u32 terminate_msg_len;
+	u8 is_terminate_local;
+	struct iwch_mpa_attributes mpa_attr;	/* IN-OUT */
+	struct iwch_ep *llp_stream_handle;
+	char *stream_msg_buf;	/* Last stream msg. before Idle -> RTS */
+	u32 stream_msg_buf_len;	/* Only on Idle -> RTS */
+};
+
+struct iwch_qp {
+	struct ib_qp ibqp;
+	struct iwch_dev *rhp;
+	struct iwch_ep *ep;
+	struct iwch_qp_attributes attr;
+	struct t3_wq wq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+	enum IWCH_QP_FLAGS flags;
+	struct timer_list timer;
+};
+
+static inline int qp_quiesced(struct iwch_qp *qhp)
+{
+	return (qhp->flags & QP_QUIESCED);
+}
+
+static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct iwch_qp, ibqp);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp);
+void iwch_qp_rem_ref(struct ib_qp *qp);
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn);
+
+struct iwch_ucontext {
+	struct ib_ucontext ibucontext;
+	struct cxio_ucontext uctx;
+	struct list_head mmaps;
+};
+
+static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c)
+{
+	return container_of(c, struct iwch_ucontext, ibucontext);
+}
+
+struct iwch_mm_entry {
+	struct list_head entry;
+	u64 addr;
+	unsigned len;
+};
+
+static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext,
+						u64 addr, unsigned len)
+{
+	struct list_head *pos, *nxt;
+	struct iwch_mm_entry *mm;
+
+	mutex_lock(&ucontext->uctx.lock);
+	list_for_each_safe(pos, nxt, &ucontext->mmaps) {
+		
+		mm = list_entry(pos, struct iwch_mm_entry, entry);
+		if (mm->addr == addr && mm->len == len) {
+			list_del_init(&mm->entry);
+			mutex_unlock(&ucontext->uctx.lock);
+			PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, 
+			     mm->len);
+			return mm;
+		}
+	}
+	mutex_unlock(&ucontext->uctx.lock);
+	return NULL;
+}
+
+static inline void insert_mmap(struct iwch_ucontext *ucontext, 
+			       struct iwch_mm_entry *mm)
+{
+	mutex_lock(&ucontext->uctx.lock);
+	PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len);
+	list_add_tail(&mm->entry, &ucontext->mmaps);
+	mutex_unlock(&ucontext->uctx.lock);
+}
+
+enum iwch_qp_attr_mask {
+	IWCH_QP_ATTR_NEXT_STATE = 1 << 0,
+	IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
+	IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
+	IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
+	IWCH_QP_ATTR_MAX_ORD = 1 << 11,
+	IWCH_QP_ATTR_MAX_IRD = 1 << 12,
+	IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22,
+	IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23,
+	IWCH_QP_ATTR_MPA_ATTR = 1 << 24,
+	IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25,
+	IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ |
+				     IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+				     IWCH_QP_ATTR_MAX_ORD |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_STREAM_MSG_BUFFER |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE)
+};
+
+int iwch_modify_qp(struct iwch_dev *rhp,
+				struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal);
+
+enum iwch_qp_state {
+	IWCH_QP_STATE_IDLE,
+	IWCH_QP_STATE_RTS,
+	IWCH_QP_STATE_ERROR,
+	IWCH_QP_STATE_TERMINATE,
+	IWCH_QP_STATE_CLOSING,
+	IWCH_QP_STATE_TOT
+};
+
+static inline int iwch_convert_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+	case IB_QPS_INIT:
+		return IWCH_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return IWCH_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return IWCH_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return IWCH_QP_STATE_TERMINATE;
+	case IB_QPS_ERR:
+		return IWCH_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+enum iwch_mem_perms {
+	IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0,
+	IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1,
+	IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2,
+	IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3,
+	IWCH_MEM_ACCESS_ATOMICS = 1 << 4,
+	IWCH_MEM_ACCESS_BINDING = 1 << 5,
+	IWCH_MEM_ACCESS_LOCAL =
+	    (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE),
+	IWCH_MEM_ACCESS_REMOTE =
+	    (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ)
+	    /* cannot go beyond 1 << 31 */
+} __attribute__ ((packed));
+
+static inline u32 iwch_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0)
+	    | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) |
+	    (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) |
+	    IWCH_MEM_ACCESS_LOCAL_READ;
+}
+
+enum iwch_mmid_state {
+	IWCH_STAG_STATE_VALID,
+	IWCH_STAG_STATE_INVALID
+};
+
+enum iwch_qp_query_flags {
+	IWCH_QP_QUERY_CONTEXT_NONE = 0x0,	/* No ctx; Only attrs */
+	IWCH_QP_QUERY_CONTEXT_GET = 0x1,	/* Get ctx + attrs */
+	IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2,	/* Not Supported */
+
+	/* 
+	 * Quiesce QP context; Consumer 
+	 * will NOT replay outstanding WR
+	 */
+	IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4,
+	IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8,
+	IWCH_QP_QUERY_TEST_USERWRITE = 0x32	/* Test special */
+};
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr);
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr);
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind);
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg);
+int iwch_register_device(struct iwch_dev *dev);
+void iwch_unregister_device(struct iwch_dev *dev);
+int iwch_quiesce_qps(struct iwch_cq *chp);
+int iwch_resume_qps(struct iwch_cq *chp);
+void stop_read_rep_timer(struct iwch_qp *qhp);
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list);
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages);
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list);
+
+
+#define IWCH_NODE_DESC "cxgb3 Chelsio Communications"
+
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h
new file mode 100644
index 0000000..4e4b9c9
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_user.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_USER_H__
+#define __IWCH_USER_H__
+
+#define IWCH_UVERBS_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct iwch_create_cq_resp {
+	__u64 physaddr;		
+	__u32 cqid;
+	__u32 size_log2;
+};
+
+struct iwch_create_qp_resp {
+	__u64 physaddr;
+	__u64 doorbell;	
+	__u32 qpid;
+	__u32 size_log2;
+	__u32 sq_size_log2;
+	__u32 rq_size_log2;
+};
+
+struct iwch_reg_user_mr_resp {
+	__u32 pbl_addr;
+};
+
+struct iwch_req_notify_cq {
+	__u32 rptr;
+};
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:49:58 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:49:58 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202224958.27014.65970.stgit@dell3.ogc.int>


This code implements the iWARP CM provider methods for the Chelsio driver.
The Chelsio ULLD is used to setup and teardown TCP connections, and the
T3 RDMA Core is used to move the connections in and out of RDMA mode.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c | 2059 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h |  223 ++++
 2 files changed, 2282 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
new file mode 100644
index 0000000..5c59396
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -0,0 +1,2059 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/skbuff.h>
+#include <linux/timer.h>
+#include <linux/notifier.h>
+
+#include <net/neighbour.h>
+#include <net/netevent.h>
+#include <net/route.h>
+
+#include "tcb.h"
+#include "cxgb3_offload.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+
+char *states[] = {
+	"idle",
+	"listen",
+	"connecting",
+	"mpa_wait_req",
+	"mpa_req_sent",
+	"mpa_req_rcvd",
+	"mpa_rep_sent",
+	"fpdu_mode",
+	"aborting",
+	"closing",
+	"moribund",
+	"dead",
+	NULL,
+};
+
+static int ep_timeout_secs = 10;
+module_param(ep_timeout_secs, int, 0444);
+MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
+				   "in seconds (default=10)");
+
+static int mpa_rev = 1;
+module_param(mpa_rev, int, 0444);
+MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
+		 "1 is spec compliant. (default=1)");
+
+static int markers_enabled = 0;
+module_param(markers_enabled, int, 0444);
+MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
+
+static int crc_enabled = 1;
+module_param(crc_enabled, int, 0444);
+MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
+
+static int rcv_win = 512 * 1024;
+module_param(rcv_win, int, 0444);
+MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)");
+
+static int snd_win = 512 * 1024;
+module_param(snd_win, int, 0444);
+MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)");
+
+static unsigned int nocong = 1;
+module_param(nocong, uint, 0444);
+MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)");
+
+static void process_work(void *ctx);
+static struct workqueue_struct *workq;
+DECLARE_WORK(skb_work, process_work, NULL);
+
+static struct sk_buff_head rxq;
+static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS];
+
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp);
+static void ep_timeout(unsigned long arg);
+static void connect_reply_upcall(struct iwch_ep *ep, int status);
+
+static void start_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	if (timer_pending(&ep->timer)) {
+		PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep);
+		del_timer_sync(&ep->timer);
+	} else
+		get_ep(&ep->com);
+	ep->timer.expires = jiffies + ep_timeout_secs * HZ;
+	ep->timer.data = (unsigned long)ep;
+	ep->timer.function = ep_timeout;
+	add_timer(&ep->timer);
+}
+
+static void stop_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	del_timer_sync(&ep->timer);
+	put_ep(&ep->com);
+}
+
+static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
+{
+	struct cpl_tid_release *req;
+
+	skb = get_skb(skb, sizeof *req, GFP_KERNEL);
+	if (!skb)
+		return;
+	req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
+	skb->priority = CPL_PRIORITY_SETUP;
+	tdev->send(tdev, skb);
+	return;
+}
+
+int iwch_quiesce_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+int iwch_resume_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = 0;
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static void set_emss(struct iwch_ep *ep, u16 opt)
+{
+	PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt);
+	ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40;
+	if (G_TCPOPT_TSTAMP(opt))
+		ep->emss -= 12;
+	if (ep->emss < 128)
+		ep->emss = 128;
+	PDBG("emss=%d\n", ep->emss);
+}
+
+static int state_comp_exch(struct iwch_ep_common *epc,
+			   enum iwch_ep_state comp, 
+			   enum iwch_ep_state exch)
+{
+        unsigned long flags;
+        int ret;
+
+        spin_lock_irqsave(&epc->lock, flags);
+        ret = (epc->state == comp);
+        if (ret)
+                epc->state = exch;
+        spin_unlock_irqrestore(&epc->lock, flags);
+        return ret;
+}
+
+static enum iwch_ep_state state_read(struct iwch_ep_common *epc)
+{
+	unsigned long flags;
+	enum iwch_ep_state state;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	state = epc->state;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return state;
+}
+
+static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], 
+		states[new]);
+	epc->state = new;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return;
+}
+
+static void *alloc_ep(int size, gfp_t gfp)
+{
+	struct iwch_ep_common *epc;
+
+	epc = kmalloc(size, gfp);
+	if (epc) {
+		memset(epc, 0, size);
+		kref_init(&epc->kref);
+		spin_lock_init(&epc->lock);
+		init_waitqueue_head(&epc->waitq);
+	}
+	PDBG("%s alloc ep %p\n", __FUNCTION__, epc);
+	return (void *) epc;
+}
+
+void __free_ep(struct kref *kref) 
+{
+	struct iwch_ep_common *epc;
+	epc = container_of(kref, struct iwch_ep_common, kref);
+	PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]);
+	kfree(epc);
+}
+
+static void release_ep_resources(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	state_set(&ep->com, DEAD);
+	cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, ep->hwtid, NULL);
+	put_ep(&ep->com);
+}
+
+static void process_work(void *ctx)
+{
+	struct sk_buff *skb = NULL;
+	void *ep;
+	struct t3cdev *tdev;
+	int ret;
+
+	while ((skb = skb_dequeue(&rxq))) {
+		ep = *((void **) (skb->cb));
+		tdev = *((struct t3cdev **) (skb->cb + sizeof(void *)));
+		ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep);
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+
+		/* 
+		 * ep was referenced in sched(), and is freed here.
+		 */
+		put_ep((struct iwch_ep_common *)ep);
+	}
+}
+
+static int status2errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_NONE:
+		return 0;
+	case CPL_ERR_CONN_RESET:
+		return -ECONNRESET;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Try and reuse skbs already allocated...
+ */
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp)
+{
+	if (skb) {
+		BUG_ON(skb_cloned(skb));
+		skb_trim(skb, 0);
+		skb_get(skb);
+	} else {
+		skb = alloc_skb(len, gfp);
+	}
+	return skb;
+}
+
+static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, 
+				 __be32 peer_ip, __be16 local_port,
+				 __be16 peer_port, u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = 0,
+		.nl_u = {
+			 .ip4_u = {
+				   .daddr = peer_ip,
+				   .saddr = local_ip,
+				   .tos = tos}
+			 },
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			  .ports = {
+				    .sport = local_port,
+				    .dport = peer_port}
+			  }
+	};
+
+	if (ip_route_output_flow(&rt, &fl, NULL, 0))
+		return NULL;
+	return rt;
+}
+
+static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu)
+		++i;
+	return i;
+}
+
+static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for an active open.   
+ */
+static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR MOD "ARP failure duing connect\n");
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for a CPL_ABORT_REQ.  Change it into a no RST variant
+ * and send it along.
+ */
+static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_abort_req *req = cplhdr(skb);
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb3_ofld_send(dev, skb);
+}
+
+static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
+{
+	struct cpl_close_con_req *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	struct cpl_abort_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(skb, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, abort_arp_failure);
+	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
+	req->cmd = CPL_ABORT_SEND_RST;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_connect(struct iwch_ep *ep)
+{
+	struct cpl_act_open_req *req;
+	struct sk_buff *skb;
+	u32 opt0h, opt0l, opt2;
+	unsigned int mtu_idx;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+	skb->priority = CPL_PRIORITY_SETUP;
+	set_arp_failure_handler(skb, act_open_req_arp_failure);
+
+	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->peer_port = ep->com.remote_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
+	req->opt0h = htonl(opt0h);
+	req->opt0l = htonl(opt0l);
+	req->params = 0;
+	req->opt2 = htonl(opt2);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+
+	PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen);
+
+	BUG_ON(skb_cloned(skb));
+
+	mpalen = sizeof(*mpa) + ep->plen;
+	if (skb->data + mpalen + sizeof(*req) > skb->end) {
+		kfree_skb(skb);
+		skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL);
+		if (!skb) {
+			connect_reply_upcall(ep, -ENOMEM);
+			return;
+		}
+	}
+	skb_trim(skb, 0);
+	skb_reserve(skb, sizeof(*req));
+	skb_put(skb, mpalen);
+	skb->priority = CPL_PRIORITY_DATA;
+	mpa = (struct mpa_message *) skb->data;
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key));
+	mpa->flags = (crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->private_data_size = htons(ep->plen);
+	mpa->revision = mpa_rev;
+
+	if (ep->plen)
+		memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	start_ep_timer(ep);
+	state_set(&ep->com, MPA_REQ_SENT);
+	return;
+}
+
+static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = MPA_REJECT;
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb again.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(mpalen);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	ep->mpa_skb = skb;
+	state_set(&ep->com, MPA_REP_SENT);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int tid = GET_TID(req);
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid);
+
+	dst_confirm(ep->dst);
+
+	/* setup the hwtid for this connection */
+	ep->hwtid = tid;
+	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
+
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	/* dealloc the atid */
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+
+	/* start MPA negotiation */
+	send_mpa_req(ep, skb);
+
+	return 0;
+}
+
+static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	PDBG("%s ep %p\n", __FILE__, ep);
+	state_set(&ep->com, ABORTING);
+	send_abort(ep, skb, GFP_KERNEL);
+}
+
+static void close_complete_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	if (ep->com.cm_id) {
+		PDBG("close complete delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void peer_close_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_DISCONNECT;
+	if (ep->com.cm_id) {
+		PDBG("peer close delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static void peer_abort_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	event.status = -ECONNRESET;
+	if (ep->com.cm_id) {
+		PDBG("abort delivered ep %p cm_id %p tid %d\n", ep,
+		     ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_reply_upcall(struct iwch_ep *ep, int status)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REPLY;
+	event.status = status;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+
+	if ((status == 0) || (status == -ECONNREFUSED)) {
+		event.private_data_len = ep->plen;
+		event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	}
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, 
+		     ep->hwtid, status);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+	if (status < 0) {
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_request_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REQUEST;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+	event.private_data_len = ep->plen;
+	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	event.provider_data = ep;
+	if (state_read(&ep->parent_ep->com) != DEAD)
+		ep->parent_ep->com.cm_id->event_handler(
+						ep->parent_ep->com.cm_id,
+						&event);
+	put_ep(&ep->parent_ep->com);
+	ep->parent_ep = NULL;
+}
+
+static void established_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_ESTABLISHED;
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static int update_rx_credits(struct iwch_ep *ep, u32 credits)
+{
+	struct cpl_rx_data_ack *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n");
+		return 0;
+	}
+
+	req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
+	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
+	skb->priority = CPL_PRIORITY_ACK;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return credits;
+}
+
+static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	int err;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+	state_set(&ep->com, FPDU_MODE);
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * if we don't even have the mpa message, then bail. 
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* Validate MPA header. */
+	if (mpa->revision != mpa_rev) {
+		err = -EPROTO;
+		goto err;
+	}
+	if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	if (mpa->flags & MPA_REJECT) {
+		err = -ECONNREFUSED;
+		goto err;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data. And
+	 * the MPA header is valid.
+	 */
+
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ird;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+	    IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR |
+	    IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD;
+
+	/* bind QP and TID with INIT_WR */
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+	if (!err)
+		goto out;
+err:
+	abort_connection(ep, skb);
+out:
+	connect_reply_upcall(ep, err);
+	return;
+}
+
+static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/*
+	 * Copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * If we don't even have the mpa message, then bail. 
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* 
+	 * Validate MPA Header.
+	 */
+	if (mpa->revision != mpa_rev) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		abort_connection(ep, skb);
+		return;
+	}
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data.
+	 */
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	state_set(&ep->com, MPA_REQ_RCVD);
+
+	/* drive upcall */
+	connect_request_upcall(ep);
+	return;
+}
+
+static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_rx_data *hdr = cplhdr(skb);
+	unsigned int dlen = ntohs(hdr->len);
+
+	PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen);
+
+	skb_pull(skb, sizeof(*hdr));
+	skb_trim(skb, dlen);
+
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_SENT:
+		process_mpa_reply(ep, skb);
+		break;
+	case MPA_REQ_WAIT:
+		process_mpa_request(ep, skb);
+		break;
+	case MPA_REP_SENT:
+		break;
+	default:
+		printk(KERN_ERR MOD "%s Unexpected streaming data."
+		       " ep %p state %d tid %d\n",
+		       __FUNCTION__, ep, state_read(&ep->com), ep->hwtid);
+
+		/* 
+	 	 * The ep will timeout and inform the ULP of the failure.
+		 * See ep_timeout().
+	 	 */
+		break;
+	}
+
+	/* update RX credits */
+	update_rx_credits(ep, dlen);
+
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Upcall from the adapter indicating data has been transmitted.
+ * For us its just the single MPA request or reply.  We can now free
+ * the skb holding the mpa message.
+ */
+static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_wr_ack *hdr = cplhdr(skb);
+	unsigned int credits = ntohs(hdr->credits);
+	enum iwch_qp_attr_mask  mask;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+
+	if (credits == 0)
+		return CPL_RET_BUF_DONE;
+	BUG_ON(credits != 1);
+	BUG_ON(ep->mpa_skb == NULL);
+	kfree_skb(ep->mpa_skb);
+	ep->mpa_skb = NULL;
+	dst_confirm(ep->dst);
+	if (state_read(&ep->com) == MPA_REP_SENT) {
+		struct iwch_qp_attributes attrs;
+
+		/* bind QP to EP and move to RTS */
+		attrs.mpa_attr = ep->mpa_attr;
+		attrs.max_ird = ep->ord;
+		attrs.max_ord = ep->ord;
+		attrs.llp_stream_handle = ep;
+		attrs.next_state = IWCH_QP_STATE_RTS;
+
+		/* bind QP and TID with INIT_WR */
+		mask = IWCH_QP_ATTR_NEXT_STATE |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_MAX_ORD;
+
+		ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, mask, &attrs, 1);
+
+		if (!ep->com.rpl_err) {
+			state_set(&ep->com, FPDU_MODE);
+			established_upcall(ep);
+		}
+
+		ep->com.rpl_done = 1;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	close_complete_upcall(ep);
+	release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status,
+	     status2errno(rpl->status));
+	connect_reply_upcall(ep, status2errno(rpl->status));
+	state_set(&ep->com, DEAD);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, GET_TID(rpl), NULL);
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	put_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_start(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_pass_open_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n");
+		return -ENOMEM;
+	}
+
+	req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_port = 0;
+	req->peer_ip = 0;
+	req->peer_netmask = 0;
+	req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS);
+	req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10));
+	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
+
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_pass_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, 
+	     rpl->status, status2errno(rpl->status));
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_stop(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_close_listserv_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
+			     void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_close_listserv_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+	return CPL_RET_BUF_DONE;
+}
+
+static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+	unsigned int mtu_idx;
+	u32 opt0h, opt0l, opt2;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+
+	rpl = cplhdr(skb);
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(opt0h);
+	rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT);
+	rpl->opt2 = htonl(opt2);
+	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
+	skb->priority = CPL_PRIORITY_SETUP;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+
+	return;
+}
+
+static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
+		      struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, 
+	     peer_ip);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(struct cpl_tid_release));
+	skb_get(skb);
+
+	if (tdev->type == T3B)
+		release_tid(tdev, hwtid, skb);
+	else {
+		struct cpl_pass_accept_rpl *rpl;
+
+		rpl = cplhdr(skb);
+		skb->priority = CPL_PRIORITY_SETUP;
+		rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+		OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, 
+						      hwtid));
+		rpl->peer_ip = peer_ip;
+		rpl->opt0h = htonl(F_TCAM_BYPASS);
+		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
+		rpl->opt2 = 0;
+		rpl->rsvd = rpl->opt2;
+		tdev->send(tdev, skb);
+	}
+}
+
+static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *child_ep, *parent_ep = ctx;
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int hwtid = GET_TID(req);
+	struct dst_entry *dst;
+	struct l2t_entry *l2t;
+	struct rtable *rt;
+	struct iff_mac tim;
+
+	PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid);
+
+	if (state_read(&parent_ep->com) != LISTEN) {
+		printk(KERN_ERR "%s - listening ep not in LISTEN\n", 
+		       __FUNCTION__);
+		goto reject;
+	}
+
+	/*
+	 * Find the netdev for this connection request.
+	 */
+	tim.mac_addr = req->dst_mac;
+	tim.vlan_tag = ntohs(req->vlan_tag);
+	if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) {
+		printk(KERN_ERR 
+			"%s bad dst mac %02x %02x %02x %02x %02x %02x\n",
+			__FUNCTION__,
+			req->dst_mac[0],
+			req->dst_mac[1],
+			req->dst_mac[2],
+			req->dst_mac[3],
+			req->dst_mac[4],
+			req->dst_mac[5]);
+		goto reject;
+	}
+
+	/* Find output route */
+	rt = find_route(tdev,
+			req->local_ip,
+			req->peer_ip,
+			req->local_port,
+			req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid)));
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - failed to find dst entry!\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	dst = &rt->u.dst;
+	l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port);
+	if (!l2t) {
+		printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n",
+		       __FUNCTION__);
+		dst_release(dst);
+		goto reject;
+	}
+	child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL);
+	if (!child_ep) {
+		printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n",
+		       __FUNCTION__);
+		l2t_release(L2DATA(tdev), l2t);
+		dst_release(dst);
+		goto reject;
+	}
+	state_set(&child_ep->com, CONNECTING);
+	child_ep->com.tdev = tdev;
+	child_ep->com.cm_id = NULL;
+	child_ep->com.local_addr.sin_family = PF_INET;
+	child_ep->com.local_addr.sin_port = req->local_port;
+	child_ep->com.local_addr.sin_addr.s_addr = req->local_ip;
+	child_ep->com.remote_addr.sin_family = PF_INET;
+	child_ep->com.remote_addr.sin_port = req->peer_port;
+	child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip;
+	get_ep(&parent_ep->com);
+	child_ep->parent_ep = parent_ep;
+	child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid));
+	child_ep->l2t = l2t;
+	child_ep->dst = dst;
+	child_ep->hwtid = hwtid;
+	init_timer(&child_ep->timer);
+	cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid);
+	accept_cr(child_ep, req->peer_ip, skb);
+	goto out;
+reject:
+	reject_cr(tdev, hwtid, req->peer_ip, skb);
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_pass_establish *req = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	dst_confirm(ep->dst);
+	state_set(&ep->com, MPA_REQ_WAIT);
+	start_ep_timer(ep);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int abort = 0;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	dst_confirm(ep->dst);
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_WAIT:
+		state_set(&ep->com, CLOSING);
+		break;
+	case MPA_REQ_SENT:
+		state_set(&ep->com, CLOSING);
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REQ_RCVD:
+
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		state_set(&ep->com, CLOSING);
+		get_ep(&ep->com);
+		break;
+	case MPA_REP_SENT:
+		state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case FPDU_MODE:
+		state_set(&ep->com, CLOSING);
+		peer_close_upcall(ep);
+		attrs.next_state = IWCH_QP_STATE_CLOSING;
+		ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		if (ret) {
+			printk(KERN_ERR MOD "%s - qp <- closing err!\n",
+			       __FUNCTION__);
+			abort = 1;
+		}
+		break;
+	case ABORTING:
+		goto out;
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		goto out;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+				       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				       &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		goto out;
+	case DEAD:
+		goto out;
+	default:
+		BUG_ON(1);
+	}
+	iwch_ep_disconnect(ep, abort, GFP_KERNEL);	
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+        return status == CPL_ERR_RTX_NEG_ADVICE ||
+               status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_abort_req_rss *req = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+	struct cpl_abort_rpl *rpl;
+	struct sk_buff *rpl_skb;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int state;
+
+	if (is_neg_adv_abort(req->status)) {
+		PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, 
+		     ep->hwtid);
+		t3_l2t_send_event(ep->com.tdev, ep->l2t);
+		return CPL_RET_BUF_DONE;
+	}
+
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state);
+	switch (state) {
+	case CONNECTING:
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REP_SENT:
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case MPA_REQ_RCVD:
+	
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		get_ep(&ep->com);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+	case FPDU_MODE:
+	case CLOSING:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+			if (ret)
+				printk(KERN_ERR MOD 
+				       "%s - qp <- error failed!\n",
+				       __FUNCTION__);
+		}
+		peer_abort_upcall(ep);
+		break;
+	case ABORTING:
+		break;
+	case DEAD:
+		PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__);
+		return CPL_RET_BUF_DONE;
+	default:
+		BUG_ON(1);
+		break;
+	}
+	dst_confirm(ep->dst);
+	
+	rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL);
+	if (!rpl_skb) {
+		printk(KERN_ERR MOD "%s - cannot allocate skb!\n",
+		       __FUNCTION__);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		put_ep(&ep->com);
+		return CPL_RET_BUF_DONE;
+	}
+	rpl_skb->priority = CPL_PRIORITY_DATA;
+	rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl));
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL));
+	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
+	rpl->cmd = CPL_ABORT_NO_RST;
+	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	if (state != ABORTING)
+		release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(!ep);
+
+	/* The cm_id may be null if we failed to connect */
+	switch (state_read(&ep->com)) {
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if ((ep->com.cm_id) && (ep->com.qp)) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+					     ep->com.qp, 
+					     IWCH_QP_ATTR_NEXT_STATE,
+					     &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		break;
+	case DEAD:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * T3A does 3 things when a TERM is received:
+ * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet
+ * 2) generate an async event on the QP with the TERMINATE opcode
+ * 3) post a TERMINATE opcde cqe into the associated CQ.
+ *
+ * For (1), we save the message in the qp for later consumer consumption.
+ * For (2), we move the QP into TERMINATE, post a QP event and disconnect.
+ * For (3), we toss the CQE in cxio_poll_cq().
+ * 
+ * terminate() handles case (1)...
+ */
+static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
+	PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len);
+	memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len);
+	ep->com.qp->attr.terminate_msg_len = skb->len;
+	ep->com.qp->attr.is_terminate_local = 0;
+	return CPL_RET_BUF_DONE;
+}
+
+static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_rdma_ec_status *rep = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, 
+	     rep->status);
+	if (rep->status) {
+		struct iwch_qp_attributes attrs;
+
+		printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n",
+		       __FUNCTION__, ep->hwtid);
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(ep->com.qp->rhp,
+			       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+			       &attrs, 1);
+		abort_connection(ep, NULL);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static void ep_timeout(unsigned long arg)
+{
+	struct iwch_ep *ep = (struct iwch_ep *)arg;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) {
+		struct sk_buff *skb;
+
+		connect_reply_upcall(ep, -ETIMEDOUT);
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) {
+		struct sk_buff *skb;
+
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) {
+		struct sk_buff *skb;
+
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		}
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	put_ep(&ep->com);
+}
+
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	state_set(&ep->com, CLOSING);
+	if (mpa_rev == 0)
+		abort_connection(ep, NULL);
+	else {
+		err = send_mpa_reject(ep, pdata, pdata_len);
+		err = send_halfclose(ep, GFP_KERNEL);
+	}
+	return 0;
+}
+
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	struct iwch_ep *ep = to_ep(cm_id);
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	BUG_ON(!qp);
+
+	if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) ||
+	    (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) {
+		abort_connection(ep, NULL);
+		return -EINVAL;
+	}
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = qp;
+
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+	get_ep(&ep->com);
+	err = send_mpa_reply(ep, conn_param->private_data, 
+			     conn_param->private_data_len);
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+		put_ep(&ep->com);
+		return err;
+	}
+	
+	/* bind QP to EP and move to RTS */
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ord;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	/* bind QP and TID with INIT_WR */
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+			     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+			     IWCH_QP_ATTR_MPA_ATTR |
+			     IWCH_QP_ATTR_MAX_IRD |
+			     IWCH_QP_ATTR_MAX_ORD;
+
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+	} else {
+		state_set(&ep->com, FPDU_MODE);
+		established_upcall(ep);
+	}
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_ep *ep;
+	struct rtable *rt;
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto out;
+	}
+	init_timer(&ep->timer);
+	ep->plen = conn_param->private_data_len;
+	if (ep->plen)
+		memcpy(ep->mpa_pkt + sizeof(struct mpa_message), 
+		       conn_param->private_data, ep->plen);
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	ep->com.tdev = h->rdev.t3cdev_p;
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+	PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, 
+	     ep->com.qp, cm_id);
+
+	/* 
+	 * Allocate an active TID to initiate a TCP connection. 
+	 */
+	ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->atid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	/* find a route */
+	rt = find_route(h->rdev.t3cdev_p,
+			cm_id->local_addr.sin_addr.s_addr,
+			cm_id->remote_addr.sin_addr.s_addr,
+			cm_id->local_addr.sin_port,
+			cm_id->remote_addr.sin_port, IPTOS_LOWDELAY);
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__);
+		err = -EHOSTUNREACH;
+		goto fail3;
+	}
+	ep->dst = &rt->u.dst;
+
+	/* get a l2t entry */
+	ep->l2t = t3_l2t_get(ep->com.tdev,
+			     ep->dst->neighbour,
+			     ep->dst->neighbour->dev->if_port);
+	if (!ep->l2t) {
+		printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail4;
+	}
+
+	state_set(&ep->com, CONNECTING);
+	ep->tos = IPTOS_LOWDELAY;
+	ep->com.local_addr = cm_id->local_addr;
+	ep->com.remote_addr = cm_id->remote_addr;
+
+	/* send connect request to rnic */
+	err = send_connect(ep);
+	if (!err)
+		goto out;
+
+	l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t);
+fail4:
+	dst_release(ep->dst);
+fail3:
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+fail2:
+	put_ep(&ep->com);
+out:
+	return err;
+}
+
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_listen_ep *ep;
+
+
+	might_sleep();
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail1;
+	}
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.tdev = h->rdev.t3cdev_p;
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->backlog = backlog;
+	ep->com.local_addr = cm_id->local_addr;
+
+	/* 
+	 * Allocate a server TID.
+	 */
+	ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->stid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	state_set(&ep->com, LISTEN);
+	err = listen_start(ep);
+	if (err)
+		goto fail3;
+
+	/* wait for pass_open_rpl */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (!err) {
+		cm_id->provider_data = ep;
+		goto out;
+	}
+fail3:
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+fail2:
+	put_ep(&ep->com);
+fail1:
+out:
+	return err;
+}
+
+int iwch_destroy_listen(struct iw_cm_id *cm_id)
+{
+	int err;
+	struct iwch_listen_ep *ep = to_listen_ep(cm_id);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	might_sleep();
+	state_set(&ep->com, DEAD);
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	err = listen_stop(ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+	err = ep->com.rpl_err;
+	cm_id->rem_ref(cm_id);
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
+{
+	int ret=0;
+	int state;
+
+	
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, 
+	     states[state], abrupt);
+	if (state == DEAD) {
+		PDBG("%s already dead ep %p\n", __FUNCTION__, ep);
+		return 0;
+	}
+	if (abrupt) {
+		if (state != ABORTING) {
+			state_set(&ep->com, ABORTING);
+			ret = send_abort(ep, NULL, gfp);
+		}
+	} else {
+
+		if (state != CLOSING)
+			state_set(&ep->com, CLOSING);
+		else {
+			start_ep_timer(ep);
+			state_set(&ep->com, MORIBUND);
+		}
+
+		ret = send_halfclose(ep, gfp);
+	}
+	return ret;
+}
+
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, 
+		     struct l2t_entry *l2t)
+{
+	struct iwch_ep *ep = ctx;
+	
+	if (ep->dst != old)
+		return 0;
+
+	PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, 
+	     l2t);
+	dst_hold(new);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	ep->l2t = l2t;
+	dst_release(old);
+	ep->dst = new;
+	return 1;
+}
+
+/* 
+ * All the CM events are handled on a work queue to have a safe context.
+ */
+static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep_common *epc = ctx;
+
+	get_ep(epc);
+
+	/*
+	 * Save ctx and tdev in the skb->cb area.
+	 */
+	*((void **) skb->cb) = ctx;
+	*((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev;
+
+	/* 
+	 * Queue the skb and schedule the worker thread.
+	 */
+	skb_queue_tail(&rxq, skb);
+	queue_work(workq, &skb_work);
+	return 0;
+}
+
+int __init iwch_cm_init(void)
+{
+	skb_queue_head_init(&rxq);
+
+	workq = create_singlethread_workqueue("iw_cxgb3");
+	if (!workq)
+		return -ENOMEM;
+
+	/*
+	 * All upcalls from the T3 Core go to sched() to 
+	 * schedule the processing on a work queue.
+	 */
+	t3c_handlers[CPL_ACT_ESTABLISH] = sched;
+	t3c_handlers[CPL_ACT_OPEN_RPL] = sched;
+	t3c_handlers[CPL_RX_DATA] = sched;
+	t3c_handlers[CPL_TX_DMA_ACK] = sched;
+	t3c_handlers[CPL_ABORT_RPL_RSS] = sched;
+	t3c_handlers[CPL_ABORT_RPL] = sched;
+	t3c_handlers[CPL_PASS_OPEN_RPL] = sched;
+	t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched;
+	t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched;
+	t3c_handlers[CPL_PASS_ESTABLISH] = sched;
+	t3c_handlers[CPL_PEER_CLOSE] = sched;
+	t3c_handlers[CPL_CLOSE_CON_RPL] = sched;
+	t3c_handlers[CPL_ABORT_REQ_RSS] = sched;
+	t3c_handlers[CPL_RDMA_TERMINATE] = sched;
+	t3c_handlers[CPL_RDMA_EC_STATUS] = sched;
+
+	/*
+	 * These are the real handlers that are called from a 
+	 * work queue.
+	 */
+	work_handlers[CPL_ACT_ESTABLISH] = act_establish;
+	work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl;
+	work_handlers[CPL_RX_DATA] = rx_data;
+	work_handlers[CPL_TX_DMA_ACK] = tx_ack;
+	work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl;
+	work_handlers[CPL_ABORT_RPL] = abort_rpl;
+	work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl;
+	work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl;
+	work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req;
+	work_handlers[CPL_PASS_ESTABLISH] = pass_establish;
+	work_handlers[CPL_PEER_CLOSE] = peer_close;
+	work_handlers[CPL_ABORT_REQ_RSS] = peer_abort;
+	work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl;
+	work_handlers[CPL_RDMA_TERMINATE] = terminate;
+	work_handlers[CPL_RDMA_EC_STATUS] = ec_status;
+	return 0;
+}
+
+void __exit iwch_cm_term(void)
+{
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
new file mode 100644
index 0000000..893f9d0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _IWCH_CM_H_
+#define _IWCH_CM_H_
+
+#include <linux/inet.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/iw_cm.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+
+#define MPA_KEY_REQ "MPA ID Req Frame"
+#define MPA_KEY_REP "MPA ID Rep Frame"
+
+#define MPA_MAX_PRIVATE_DATA 	256
+#define MPA_REV 		0	/* XXX - amso1100 uses rev 0 ! */
+#define MPA_REJECT 		0x20
+#define MPA_CRC			0x40
+#define MPA_MARKERS		0x80
+#define MPA_FLAGS_MASK		0xE0
+
+#define put_ep(ep) { \
+	PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__,  \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_put(&((ep)->kref), __free_ep); \
+}
+
+#define get_ep(ep) { \
+	PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_get(&((ep)->kref));  \
+}
+
+struct mpa_message {
+	u8 key[16];
+	u8 flags;
+	u8 revision;
+	__be16 private_data_size;
+	u8 private_data[0];
+};
+
+struct terminate_message {
+	u8 layer_etype;
+	u8 ecode;
+	__be16 hdrct_rsvd;
+	u8 len_hdrs[0];
+};
+
+#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28)
+
+enum iwch_layers_types {
+	LAYER_RDMAP 		= 0x00,
+	LAYER_DDP		= 0x10,
+	LAYER_MPA		= 0x20,
+	RDMAP_LOCAL_CATA	= 0x00,
+	RDMAP_REMOTE_PROT	= 0x01,
+	RDMAP_REMOTE_OP		= 0x02,
+	DDP_LOCAL_CATA		= 0x00,
+	DDP_TAGGED_ERR		= 0x01,
+	DDP_UNTAGGED_ERR	= 0x02,
+	DDP_LLP			= 0x03
+};
+
+enum iwch_rdma_ecodes {
+	RDMAP_INV_STAG		= 0x00,
+	RDMAP_BASE_BOUNDS	= 0x01,
+	RDMAP_ACC_VIOL		= 0x02,
+	RDMAP_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_TO_WRAP		= 0x04,
+	RDMAP_INV_VERS		= 0x05,
+	RDMAP_INV_OPCODE	= 0x06,
+	RDMAP_STREAM_CATA	= 0x07,
+	RDMAP_GLOBAL_CATA	= 0x08,
+	RDMAP_CANT_INV_STAG	= 0x09,
+	RDMAP_UNSPECIFIED	= 0xff	
+};
+
+enum iwch_ddp_ecodes {
+	DDPT_INV_STAG		= 0x00,
+	DDPT_BASE_BOUNDS	= 0x01,
+	DDPT_STAG_NOT_ASSOC	= 0x02,
+	DDPT_TO_WRAP		= 0x03,
+	DDPT_INV_VERS		= 0x04,
+	DDPU_INV_QN		= 0x01,
+	DDPU_INV_MSN_NOBUF	= 0x02,
+	DDPU_INV_MSN_RANGE	= 0x03,
+	DDPU_INV_MO		= 0x04,
+	DDPU_MSG_TOOBIG		= 0x05,
+	DDPU_INV_VERS		= 0x06
+};
+
+enum iwch_mpa_ecodes {
+	MPA_CRC_ERR		= 0x02,
+	MPA_MARKER_ERR		= 0x03
+};
+
+enum iwch_ep_state {
+	IDLE = 0,
+	LISTEN,	
+	CONNECTING,
+	MPA_REQ_WAIT,
+	MPA_REQ_SENT,
+	MPA_REQ_RCVD,
+	MPA_REP_SENT,
+	FPDU_MODE,
+	ABORTING,
+	CLOSING,
+	MORIBUND,
+	DEAD,
+};
+
+struct iwch_ep_common {
+	struct iw_cm_id *cm_id;
+	struct iwch_qp *qp;
+	struct t3cdev *tdev;
+	enum iwch_ep_state state;
+	struct kref kref;
+	spinlock_t lock;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	wait_queue_head_t waitq;
+	int rpl_done;
+	int rpl_err;
+};
+
+struct iwch_listen_ep {
+	struct iwch_ep_common com;
+	unsigned int stid;
+	int backlog;
+};
+
+struct iwch_ep {
+	struct iwch_ep_common com;
+	struct iwch_ep *parent_ep;
+	struct timer_list timer;
+	unsigned int atid;
+	u32 hwtid;
+	u32 snd_seq;
+	struct l2t_entry *l2t;
+	struct dst_entry *dst;
+	struct sk_buff *mpa_skb;
+	struct iwch_mpa_attributes mpa_attr;
+	unsigned int mpa_pkt_len;
+	u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA];
+	u8 tos;
+	u16 emss;
+	u16 plen;
+	u32 ird;
+	u32 ord;
+};
+
+static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_ep *)cm_id->provider_data;
+}
+
+static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_listen_ep *)cm_id->provider_data;
+}
+
+static inline int compute_wscale(int win)
+{
+	int wscale = 0;
+
+	while (wscale < 14 && (65535<<wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+/* CM prototypes */
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog);
+int iwch_destroy_listen(struct iw_cm_id *cm_id);
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp);
+int iwch_quiesce_tid(struct iwch_ep *ep);
+int iwch_resume_tid(struct iwch_ep *ep);
+void __free_ep(struct kref *kref);
+void iwch_rearp(struct iwch_ep *ep);
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t);
+
+int __init iwch_cm_init(void);
+void __exit iwch_cm_term(void);
+
+#endif				/* _IWCH_CM_H_ */


From swise at opengridcomputing.com  Sat Dec  2 14:50:08 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:08 -0600
Subject: [openib-general] [PATCH  v2 05/13] Queue Pairs
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225008.27014.4428.stgit@dell3.ogc.int>


Code to manipulate the QP.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++
 1 files changed, 1007 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 0000000..9f6b251
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1007 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 * flit_cnt)
+{
+	int i;
+	u32 plen;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE;
+		else
+			wqe->send.rdmaop = T3_SEND;
+		wqe->send.rem_stag = 0;
+		break;
+#if 0				/* Not currently supported */
+	case TYPE_SEND_INVALIDATE:
+	case TYPE_SEND_INVALIDATE_IMMEDIATE:
+		wqe->send.rdmaop = T3_SEND_WITH_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+	case TYPE_SEND_SE_INVALIDATE:
+		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+#endif
+	default:
+		break;
+	}
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->send.reserved[0] = 0;
+	wqe->send.reserved[1] = 0;
+	wqe->send.reserved[2] = 0;
+	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+		plen = 4;
+		wqe->send.sgl[0].stag = wr->imm_data;
+		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->send.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 5;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->send.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->send.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 4 + ((wr->num_sge) << 1);
+	}
+	wqe->send.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr,
+					u8 *flit_cnt)
+{
+	int i;
+	u32 plen;
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->write.rdmaop = T3_RDMA_WRITE;
+	wqe->write.reserved[0] = 0;
+	wqe->write.reserved[1] = 0;
+	wqe->write.reserved[2] = 0;
+	wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+	if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+		plen = 4;
+		wqe->write.sgl[0].stag = wr->imm_data;
+		wqe->write.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->write.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 6;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->write.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->write.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->write.sgl[i].to =
+			    cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->write.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 5 + ((wr->num_sge) << 1);
+	}
+	wqe->write.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 *flit_cnt)
+{
+	if (wr->num_sge > 1)
+		return -EINVAL;
+	wqe->read.rdmaop = T3_READ_REQ;
+	wqe->read.reserved[0] = 0;
+	wqe->read.reserved[1] = 0;
+	wqe->read.reserved[2] = 0;
+	wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr);
+	wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey);
+	wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length);
+	wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr);
+	*flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3;
+	return 0;
+}
+
+/* 
+ * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
+ */
+static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp,
+				   struct ib_sge *sg_list, u32 num_sgle,
+				   u32 * pbl_addr, u8 * page_size)
+{
+	int i;
+	struct iwch_mr *mhp;
+	u32 offset;
+	for (i = 0; i < num_sgle; i++) {
+
+		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
+		if (!mhp) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (!mhp->attr.state) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (mhp->attr.zbva) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+
+		if (sg_list[i].addr < mhp->attr.va_fbo) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) <
+		    sg_list[i].addr) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) >
+		    mhp->attr.va_fbo + ((u64) mhp->attr.len)) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		offset = sg_list[i].addr - mhp->attr.va_fbo;
+		offset += ((u32) mhp->attr.va_fbo) %
+		          (1UL << (12 + mhp->attr.page_size));
+		pbl_addr[i] = ((mhp->attr.pbl_addr - 
+			        rhp->rdev.rnic_info.pbl_base) >> 3) +
+			      (offset >> (12 + mhp->attr.page_size));
+		page_size[i] = mhp->attr.page_size;
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_recv(struct iwch_dev *rhp,
+						    union t3_wr *wqe,
+						    struct ib_recv_wr *wr)
+{
+	int i, err = 0;
+	u32 pbl_addr[4];
+	u8 page_size[4];
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, 
+			       page_size);
+	if (err)
+		return err;
+	wqe->recv.pagesz[0] = page_size[0];
+	wqe->recv.pagesz[1] = page_size[1];
+	wqe->recv.pagesz[2] = page_size[2];
+	wqe->recv.pagesz[3] = page_size[3];
+	wqe->recv.num_sgle = cpu_to_be32(wr->num_sge);
+	for (i = 0; i < wr->num_sge; i++) {
+		wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		
+		/* to in the WQE == the offset into the page */
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
+				(1UL << (12 + page_size[i])));
+
+		/* pbl_addr is the adapters address in the PBL */
+		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
+	}
+	for (; i < T3_MAX_SGE; i++) {
+		wqe->recv.sgl[i].stag = 0;
+		wqe->recv.sgl[i].len = 0;
+		wqe->recv.sgl[i].to = 0;
+		wqe->recv.pbl_addr[i] = 0;
+	}
+	return 0;
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr)
+{
+	int err = 0;
+	u8 t3_wr_flit_cnt;
+	enum t3_wr_opcode t3_wr_opcode = 0;
+	enum t3_wr_flags t3_wr_flags;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+		  qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	while (wr) {
+		if (num_wrs == 0) {
+			err = -ENOMEM;
+			*bad_wr = wr;
+			break;
+		}
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		t3_wr_flags = 0;
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
+		if (wr->send_flags & IB_SEND_FENCE)
+			t3_wr_flags |= T3_READ_FENCE_FLAG;
+		if (wr->send_flags & IB_SEND_SIGNALED)
+			t3_wr_flags |= T3_COMPLETION_FLAG;
+		sqp = qhp->wq.sq + 
+		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+		switch (wr->opcode) {
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_IMM:
+			t3_wr_opcode = T3_WR_SEND;
+			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			t3_wr_opcode = T3_WR_WRITE;
+			err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_READ:
+			t3_wr_opcode = T3_WR_READ;
+			t3_wr_flags = 0; /* T3 reads are always signaled */
+			err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt);
+			if (err) 
+				break;
+			sqp->read_len = wqe->read.local_len;
+			if (!qhp->wq.oldest_read)
+				qhp->wq.oldest_read = sqp;
+			break;
+		default:
+			PDBG("%s post of type=%d TBD!\n", __FUNCTION__,
+			     wr->opcode);
+			err = -EINVAL;
+		}
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+		sqp->wr_id = wr->wr_id;
+		sqp->opcode = wr2opcode(t3_wr_opcode);
+		sqp->sq_wptr = qhp->wq.sq_wptr;
+		sqp->complete = 0;
+		sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED);
+
+		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, t3_wr_flit_cnt);
+		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", 
+		     __FUNCTION__, wr->wr_id, idx, 
+		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
+		     sqp->opcode);
+		wr = wr->next;
+		num_wrs--;
+		++(qhp->wq.wptr);
+		++(qhp->wq.sq_wptr);
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	int err = 0;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, 
+			    qhp->wq.rq_size_log2) - 1;
+	if (!wr) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	while (wr) {
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		if (num_wrs)
+			err = iwch_build_rdma_recv(qhp->rhp, wqe, wr);
+		else
+			err = -ENOMEM;
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = 
+			wr->wr_id;
+		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, sizeof(struct t3_receive_wr) >> 3);
+		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
+		     "wqe %p \n", __FUNCTION__, wr->wr_id, idx, 
+		     qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
+		++(qhp->wq.rq_wptr);
+		++(qhp->wq.wptr);
+		wr = wr->next;
+		num_wrs--;
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_qp *qhp;
+	union t3_wr *wqe;
+	u32 pbl_addr;
+	u8 page_size;
+	u32 num_wrs;
+	unsigned long flag;
+	struct ib_sge sgl;
+	int err=0;
+	enum t3_wr_flags t3_wr_flags;
+	u32 idx;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(qp);
+	mhp = to_iwch_mw(mw);
+	rhp = qhp->rhp;
+
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+			    qhp->wq.sq_size_log2);
+	if ((num_wrs) <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, 
+	     mw, mw_bind);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+
+	t3_wr_flags = 0;
+	if (mw_bind->send_flags & IB_SEND_SIGNALED)
+		t3_wr_flags = T3_COMPLETION_FLAG;
+
+        sgl.addr = mw_bind->addr;
+        sgl.lkey = mw_bind->mr->lkey;
+        sgl.length = mw_bind->length;
+        wqe->bind.reserved = 0;
+        wqe->bind.type = T3_VA_BASED_TO;
+
+        /* TBD: check perms */
+        wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags);
+        wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey);
+        wqe->bind.mw_stag = cpu_to_be32(mw->rkey);
+        wqe->bind.mw_len = cpu_to_be32(mw_bind->length);
+        wqe->bind.mw_va = cpu_to_be64(mw_bind->addr);
+        err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size);
+        if (err) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+                return err;
+	}
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+	sqp->wr_id = mw_bind->wr_id;
+	sqp->opcode = T3_BIND_MW;
+	sqp->sq_wptr = qhp->wq.sq_wptr;
+	sqp->complete = 0;
+	sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED);
+        wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr);
+        wqe->bind.mr_pagesz = page_size;
+	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
+	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 
+			        sizeof(struct t3_bind_mw_wr) >> 3);
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+
+	return err;
+}
+
+static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode,
+				    int tagged)
+{
+	switch (t3err) {
+	case TPT_ERR_STAG:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_STAG;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_INV_STAG;
+		}
+		break;
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_STAG_NOT_ASSOC;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_STAG_NOT_ASSOC;
+		}
+		break;
+	case TPT_ERR_WRAP:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+		*ecode = RDMAP_TO_WRAP;
+		break;
+	case TPT_ERR_BOUND:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_BASE_BOUNDS;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_BASE_BOUNDS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_MSG_TOOBIG;
+		}
+		break;
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_CANT_INV_STAG;
+		break;
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		*layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_OUT_OF_RQE:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_NOBUF;
+		break;
+	case TPT_ERR_PBL_ADDR_BOUND:
+		*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+		*ecode = DDPT_BASE_BOUNDS;
+		break;
+	case TPT_ERR_CRC:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_CRC_ERR;
+		break;
+	case TPT_ERR_MARKER:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_MARKER_ERR;
+		break;
+	case TPT_ERR_PDU_LEN_ERR:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_MSG_TOOBIG;
+		break;
+	case TPT_ERR_DDP_VERSION:
+		if (tagged) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_VERS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_INV_VERS;
+		}
+		break;
+	case TPT_ERR_RDMA_VERSION:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_VERS;
+		break;
+	case TPT_ERR_OPCODE:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_OPCODE;
+		break;
+	case TPT_ERR_DDP_QUEUE_NUM:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_QN;
+		break;
+	case TPT_ERR_MSN:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_IRD_OVERFLOW:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_RANGE;
+		break;
+	case TPT_ERR_TBIT:
+		*layer_type = LAYER_DDP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_MO:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MO;
+		break;
+	default: 
+		*layer_type = LAYER_RDMAP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	}
+}
+
+/*
+ * This posts a TERMINATE with layer=RDMA, type=catastrophic.
+ */
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
+{
+	union t3_wr *wqe;
+	struct terminate_message *term;
+	int status;
+	int tagged = 0;
+	struct sk_buff *skb;
+
+	PDBG("%s %d\n", __FUNCTION__, __LINE__);
+	skb = alloc_skb(40, GFP_ATOMIC);
+	if (!skb) {
+		printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (union t3_wr *)skb_put(skb, 40);
+	memset(wqe, 0, 40);
+	wqe->send.rdmaop = T3_TERMINATE;
+	
+	/* immediate data length */
+	wqe->send.plen = htonl(4);
+
+	/* immediate data starts here. */
+	term = (struct terminate_message *)wqe->send.sgl;
+	if (rsp_msg) {
+		status = CQE_STATUS(rsp_msg->cqe);
+		if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)
+			tagged = 1;
+		if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) ||
+		    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP))
+			tagged = 2;
+	} else {
+		status = TPT_ERR_INTERNAL_ERR;
+	}
+	build_term_codes(status, &term->layer_etype, &term->ecode, tagged);
+	build_fw_riwrh((void *)wqe, T3_WR_SEND, 
+		       T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, 
+		       qhp->ep->hwtid, 5);
+	skb->priority = CPL_PRIORITY_DATA;
+	return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb));
+}
+
+/*
+ * Assumes qhp lock is held.
+ */
+static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	struct iwch_cq *rchp, *schp;
+	int count;
+
+	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
+	schp = get_chp(qhp->rhp, qhp->attr.scq);
+	
+	PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp);
+	/* take a ref on the qhp since we must release the lock */
+	atomic_inc(&qhp->refcnt);
+	spin_unlock_irqrestore(&qhp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&rchp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&rchp->cq);
+	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
+	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&rchp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&schp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&schp->cq);
+	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
+	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&schp->lock, *flag);
+
+	/* deref */
+	if (atomic_dec_and_test(&qhp->refcnt))
+                wake_up(&qhp->wait);
+
+	spin_lock_irqsave(&qhp->lock, *flag);
+}
+
+static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	if (t3b_device(qhp->rhp))
+		cxio_set_wq_in_error(&qhp->wq);
+	else
+		__flush_qp(qhp, flag);
+}
+
+
+/* 
+ * Return non zero if at least one RECV was pre-posted.
+ */
+static inline int rqes_posted(struct iwch_qp *qhp)
+{ 
+	return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV);
+}
+
+static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs)
+{
+	struct t3_rdma_init_attr init_attr;
+	int ret;
+
+	init_attr.tid = qhp->ep->hwtid;
+	init_attr.qpid = qhp->wq.qpid;
+	init_attr.pdid = qhp->attr.pd;
+	init_attr.scqid = qhp->attr.scq;
+	init_attr.rcqid = qhp->attr.rcq;
+	init_attr.rq_addr = qhp->wq.rq_addr;
+	init_attr.rq_size = 1 << qhp->wq.rq_size_log2;
+	init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | 
+		qhp->attr.mpa_attr.recv_marker_enabled |
+		(qhp->attr.mpa_attr.xmit_marker_enabled << 1) |
+		(qhp->attr.mpa_attr.crc_enabled << 2);
+
+	/* 
+	 * XXX - The IWCM doesn't quite handle getting these
+ 	 * attrs set before going into RTS.  For now, just turn 
+	 * them on always...
+	 */
+#if 0
+	init_attr.qpcaps = qhp->attr.enableRdmaRead |
+		(qhp->attr.enableRdmaWrite << 1) |
+		(qhp->attr.enableBind << 2) |
+		(qhp->attr.enable_stag0_fastreg << 3) |
+		(qhp->attr.enable_stag0_fastreg << 4);
+#else
+	init_attr.qpcaps = 0x1f;
+#endif
+	init_attr.tcp_emss = qhp->ep->emss;
+	init_attr.ord = qhp->attr.max_ord;
+	init_attr.ird = qhp->attr.max_ird;
+	init_attr.qp_dma_addr = qhp->wq.dma_addr;
+	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
+	init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0;
+	PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d "
+	     "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, 
+	     init_attr.rq_addr, init_attr.rq_size, 
+	     init_attr.flags, init_attr.qpcaps);
+	ret = cxio_rdma_init(&rhp->rdev, &init_attr);
+	PDBG("%s ret %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal)
+{
+	int ret = 0;
+	struct iwch_qp_attributes newattr = qhp->attr;
+	unsigned long flag;
+	int disconnect = 0;
+	int terminate = 0;
+	int abort = 0;
+	int free = 0;
+	struct iwch_ep *ep = NULL;
+
+	PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, 
+	     qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, 
+	     (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1);
+
+	spin_lock_irqsave(&qhp->lock, flag);
+
+	/* Process attr changes if in IDLE */
+	if (mask & IWCH_QP_ATTR_VALID_MODIFY) {
+		if (qhp->attr.state != IWCH_QP_STATE_IDLE) {
+			ret = -EIO;
+			goto out;
+		}
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ)
+			newattr.enable_rdma_read = attrs->enable_rdma_read;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE)
+			newattr.enable_rdma_write = attrs->enable_rdma_write;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND)
+			newattr.enable_bind = attrs->enable_bind;
+		if (mask & IWCH_QP_ATTR_MAX_ORD) {
+			if (attrs->max_ord > 
+			    rhp->attr.max_rdma_read_qp_depth) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ord = attrs->max_ord;
+		}
+		if (mask & IWCH_QP_ATTR_MAX_IRD) {
+			if (attrs->max_ird > 
+		  	    rhp->attr.max_rdma_reads_per_qp) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ird = attrs->max_ird;
+		}
+		qhp->attr = newattr;
+	}
+	
+	if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) 
+		goto out;
+	if (qhp->attr.state == attrs->next_state)
+		goto out;
+
+	switch (qhp->attr.state) {
+	case IWCH_QP_STATE_IDLE:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_RTS: 
+			if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			qhp->attr.mpa_attr = attrs->mpa_attr;
+			qhp->attr.llp_stream_handle = attrs->llp_stream_handle;
+			qhp->ep = qhp->attr.llp_stream_handle;
+			qhp->attr.state = IWCH_QP_STATE_RTS;
+
+			/*
+			 * Ref the endpoint here and deref when we
+	 		 * disassociate the endpoint from the QP.  This
+			 * happens in CLOSING->IDLE transition or *->ERROR
+			 * transition.
+			 */
+			get_ep(&qhp->ep->com);
+			spin_unlock_irqrestore(&qhp->lock, flag);
+			ret = rdma_init(rhp, qhp, mask, attrs);
+			spin_lock_irqsave(&qhp->lock, flag);
+			if (ret)
+				goto err;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			flush_qp(qhp, &flag);
+			break;
+		default:
+			ret = -EINVAL;	
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_RTS:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_CLOSING:
+			BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2);
+			qhp->attr.state = IWCH_QP_STATE_CLOSING;
+			if (!internal) {
+				abort=0;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			break;
+		case IWCH_QP_STATE_TERMINATE:
+			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
+			if (!internal) 
+				terminate = 1;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			if (!internal) {
+				abort=1;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			goto err;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_CLOSING:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (attrs->next_state) {
+			case IWCH_QP_STATE_IDLE:
+				qhp->attr.state = IWCH_QP_STATE_IDLE;
+				qhp->attr.llp_stream_handle = NULL;
+				put_ep(&qhp->ep->com);
+				qhp->ep = NULL;
+				wake_up(&qhp->wait);
+				break;
+			case IWCH_QP_STATE_ERROR:
+				goto err;
+			default:
+				ret = -EINVAL;
+				goto err;
+		}
+		break;
+	case IWCH_QP_STATE_ERROR:
+		if (attrs->next_state != IWCH_QP_STATE_IDLE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		
+		if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || 
+		    !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		qhp->attr.state = IWCH_QP_STATE_IDLE;
+		memset(&qhp->attr, 0, sizeof(qhp->attr));
+		break;
+	case IWCH_QP_STATE_TERMINATE:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		goto err;
+		break;
+	default:
+		printk(KERN_ERR "%s in a bad state %d\n", 
+		       __FUNCTION__, qhp->attr.state);
+		ret = -EINVAL;
+		goto err;
+		break;
+	}
+	goto out;
+err:
+	PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, 
+	     qhp->wq.qpid);
+
+	/* disassociate the LLP connection */
+	qhp->attr.llp_stream_handle = NULL;
+	ep = qhp->ep;
+	qhp->ep = NULL;
+	qhp->attr.state = IWCH_QP_STATE_ERROR;
+	free=1;
+	wake_up(&qhp->wait);
+	BUG_ON(!ep);
+	flush_qp(qhp, &flag);
+out:
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	if (terminate)
+		iwch_post_terminate(qhp, NULL);
+
+	/*
+	 * If disconnect is 1, then we need to initiate a disconnect
+	 * on the EP.  This can be a normal close (RTS->CLOSING) or
+	 * an abnormal close (RTS/CLOSING->ERROR).
+	 */
+	if (disconnect)
+		iwch_ep_disconnect(ep, abort, GFP_KERNEL);
+
+	/* 
+	 * If free is 1, then we've disassociated the EP from the QP 
+	 * and we need to dereference the EP.
+	 */
+	if (free)
+		put_ep(&ep->com);
+
+	PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state);
+	return ret;
+}
+
+static int quiesce_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_quiesce_tid(qhp->ep);
+	qhp->flags |= QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+static int resume_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_resume_tid(qhp->ep);
+	qhp->flags &= ~QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+int iwch_quiesce_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) {
+			quiesce_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) 
+			quiesce_qp(qhp);
+	}
+	return 0;
+}
+
+int iwch_resume_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) {
+			resume_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp))
+			resume_qp(qhp);
+	}
+	return 0;
+}


From swise at opengridcomputing.com  Sat Dec  2 14:50:18 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:18 -0600
Subject: [openib-general] [PATCH  v2 06/13] Completion Queues
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225018.27014.78386.stgit@dell3.ogc.int>


Functions to manipulate CQs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 0000000..9d82df4
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 	0 			EMPTY;
+ *	1			cqe returned
+ *	-EAGAIN 		caller must try again
+ * 	any other -errno	fatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+		     struct ib_wc *wc)
+{
+	struct iwch_qp *qhp = NULL;
+	struct t3_cqe cqe, *rd_cqe;
+	struct t3_wq *wq;
+	u32 credit = 0;
+	u8 cqe_flushed;
+	u64 cookie;
+	int ret = 1;
+
+	rd_cqe = cxio_next_cqe(&chp->cq);
+
+	if (!rd_cqe)
+		return 0;
+
+	qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+	if (!qhp)
+		wq = NULL;
+	else {
+		spin_lock(&qhp->lock);
+		wq = &(qhp->wq);
+	}
+	ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie,
+				   &credit);
+	if (t3a_device(chp->rhp) && credit) {
+		PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, 
+		     credit, chp->cq.cqid);
+		cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit);
+	}
+
+	if (ret) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = 1;
+
+	wc->wr_id = cookie;
+	wc->qp_num = qhp->wq.qpid;
+	wc->vendor_err = CQE_STATUS(cqe);
+
+	PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x "
+	     "lo 0x%x cookie 0x%llx\n", __FUNCTION__, 
+	     CQE_QPID(cqe), CQE_TYPE(cqe),
+	     CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+	     CQE_WRID_LOW(cqe), cookie);
+
+	if (CQE_TYPE(cqe) == 0) {
+		if (!CQE_STATUS(cqe))
+			wc->byte_len = CQE_LEN(cqe);
+		else
+			wc->byte_len = 0;
+		wc->opcode = IB_WC_RECV;
+	} else {
+		switch (CQE_OPCODE(cqe)) {
+		case T3_RDMA_WRITE:
+			wc->opcode = IB_WC_RDMA_WRITE;
+			break;
+		case T3_READ_REQ:
+			wc->opcode = IB_WC_RDMA_READ;
+			wc->byte_len = CQE_LEN(cqe);
+			break;
+		case T3_SEND:
+		case T3_SEND_WITH_SE:
+			wc->opcode = IB_WC_SEND;
+			break;
+		case T3_BIND_MW:
+			wc->opcode = IB_WC_BIND_MW;
+			break;
+
+		/* these aren't supported yet */
+		case T3_SEND_WITH_INV:
+		case T3_SEND_WITH_SE_INV:
+		case T3_LOCAL_INV:
+		case T3_FAST_REGISTER:
+		default:
+			printk(KERN_ERR MOD "Unexpected opcode %d "
+			       "in the CQE received for QPID=0x%0x\n", 
+			       CQE_OPCODE(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (cqe_flushed)
+		wc->status = IB_WC_WR_FLUSH_ERR;
+	else {
+		
+		switch (CQE_STATUS(cqe)) {
+		case TPT_ERR_SUCCESS:
+			wc->status = IB_WC_SUCCESS;
+			break;
+		case TPT_ERR_STAG:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_PDID:
+			wc->status = IB_WC_LOC_PROT_ERR;
+			break;
+		case TPT_ERR_QPID:
+		case TPT_ERR_ACCESS:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_WRAP:
+			wc->status = IB_WC_GENERAL_ERR;
+			break;
+		case TPT_ERR_BOUND:
+			wc->status = IB_WC_LOC_LEN_ERR;
+			break;
+		case TPT_ERR_INVALIDATE_SHARED_MR:
+		case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+			wc->status = IB_WC_MW_BIND_ERR;
+			break;
+		case TPT_ERR_CRC:
+		case TPT_ERR_MARKER:
+		case TPT_ERR_PDU_LEN_ERR:
+		case TPT_ERR_OUT_OF_RQE:
+		case TPT_ERR_DDP_VERSION:
+		case TPT_ERR_RDMA_VERSION:
+		case TPT_ERR_DDP_QUEUE_NUM:
+		case TPT_ERR_MSN:
+		case TPT_ERR_TBIT:
+		case TPT_ERR_MO:
+		case TPT_ERR_MSN_RANGE:
+		case TPT_ERR_IRD_OVERFLOW:
+		case TPT_ERR_OPCODE:
+			wc->status = IB_WC_FATAL_ERR;
+			break;
+		case TPT_ERR_SWFLUSH:
+			wc->status = IB_WC_WR_FLUSH_ERR;
+			break;
+		default:
+			printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for "
+			       "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+		}
+	}
+out:
+	if (wq)
+		spin_unlock(&qhp->lock);
+	return ret;
+}
+
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	unsigned long flags;
+	int npolled;
+	int err = 0;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+
+	spin_lock_irqsave(&chp->lock, flags);
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+#ifdef DEBUG
+		int i=0;
+#endif
+
+		/*
+	 	 * Because T3 can post CQEs that are _not_ associated
+	 	 * with a WR, we might have to poll again after removing
+	 	 * one of these.  
+		 */
+		do {
+			err = iwch_poll_cq_one(rhp, chp, wc + npolled);
+#ifdef DEBUG
+			BUG_ON(++i > 1000);
+#endif
+		} while (err == -EAGAIN);
+		if (err <= 0)
+			break;
+	}
+	spin_unlock_irqrestore(&chp->lock, flags);
+
+	if (err < 0)
+		return err;
+	else {
+		return npolled;
+	}
+}
+
+int iwch_modify_cq(struct ib_cq *cq, int cqe)
+{
+	PDBG("iwch_modify_cq: TBD\n");
+	return 0;
+}


From swise at opengridcomputing.com  Sat Dec  2 14:50:28 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:28 -0600
Subject: [openib-general] [PATCH  v2 07/13] Async Event Handler
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225028.27014.27124.stgit@dell3.ogc.int>


Code to handle async events coming from the T3 RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  228 +++++++++++++++++++++++++++++++++
 1 files changed, 228 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 0000000..bf767b2
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,228 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <net/sock.h>
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+			  struct respQ_msg_t *rsp_msg,
+			  enum ib_event_type ib_event, 
+			  int send_term)
+{
+	struct ib_event event;
+	struct iwch_qp_attributes attrs;
+	struct iwch_qp *qhp;
+
+	printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+	       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+	       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+	       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+	       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+	spin_lock(&rnicp->lock);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+
+	if (!qhp) {
+		printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", 
+		       __FUNCTION__, CQE_STATUS(rsp_msg->cqe), 
+		       CQE_QPID(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	if ((qhp->attr.state == IWCH_QP_STATE_ERROR) ||
+	    (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) {
+		PDBG("%s AE received after RTS - "
+		     "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, 
+		     qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	atomic_inc(&qhp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	event.event = ib_event;
+	event.device = chp->ibcq.device;
+	if (ib_event == IB_EVENT_CQ_ERR)
+		event.element.cq = &chp->ibcq;
+	else 
+		event.element.qp = &qhp->ibqp;
+
+	if (qhp->ibqp.event_handler)
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+
+	attrs.next_state = IWCH_QP_STATE_TERMINATE;
+	if (send_term && (qhp->attr.state == IWCH_QP_STATE_RTS) && 
+	    !iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 1))
+		iwch_post_terminate(qhp, rsp_msg);
+
+	if (atomic_dec_and_test(&qhp->refcnt))
+		wake_up(&qhp->wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+	struct iwch_dev *rnicp;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	struct iwch_cq *chp;
+	struct iwch_qp *qhp;
+	u32 cqid = RSPQ_CQID(rsp_msg);
+
+	rnicp = (struct iwch_dev *) rdev_p->ulp;
+	spin_lock(&rnicp->lock);
+	chp = get_chp(rnicp, cqid);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+	if (!chp || !qhp) {
+		printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d "
+		       "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", 
+		       cqid, CQE_QPID(rsp_msg->cqe), 
+		       CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+		       CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), 
+		       CQE_WRID_LOW(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		goto out;
+	}
+	iwch_qp_add_ref(&qhp->ibqp);
+	atomic_inc(&chp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	/* 
+	 * 1) completion of our sending a TERMINATE.
+	 * 2) incoming TERMINATE message.  
+	 */
+	if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && 
+	    (CQE_STATUS(rsp_msg->cqe) == 0)) {
+		if (SQ_TYPE(rsp_msg->cqe)) {
+			PDBG("%s QPID 0x%x ep %p disconnecting\n", 
+			     __FUNCTION__, qhp->wq.qpid, qhp->ep);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		} else {
+			PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, 
+			     qhp->wq.qpid);
+			post_qp_event(rnicp, chp, rsp_msg, 
+				      IB_EVENT_QP_REQ_ERR, 0);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		}
+		goto done;
+	}
+
+	/* Bad incoming Read request */
+	if (SQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	/* Bad incoming write */
+	if (RQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	switch (CQE_STATUS(rsp_msg->cqe)) {
+
+	/* Completion Events */
+	case TPT_ERR_SUCCESS:
+
+		/* 
+		 * Confirm the destination entry if this is a RECV completion.
+		 */
+		if (qhp->ep && SQ_TYPE(rsp_msg->cqe))
+			dst_confirm(qhp->ep->dst);
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		break;
+
+	case TPT_ERR_STAG:
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+	case TPT_ERR_WRAP:
+	case TPT_ERR_BOUND:
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x "
+		       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+		       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+		       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+		       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1);
+		break;
+
+	/* Device Fatal Errors */
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1);
+		break;
+	
+	/* QP Fatal Errors */
+	case TPT_ERR_OUT_OF_RQE:
+	case TPT_ERR_PBL_ADDR_BOUND:
+	case TPT_ERR_CRC:
+	case TPT_ERR_MARKER:
+	case TPT_ERR_PDU_LEN_ERR:
+	case TPT_ERR_DDP_VERSION:
+	case TPT_ERR_RDMA_VERSION:
+	case TPT_ERR_OPCODE:
+	case TPT_ERR_DDP_QUEUE_NUM:
+	case TPT_ERR_MSN:
+	case TPT_ERR_TBIT:
+	case TPT_ERR_MO:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_RQE_ADDR_BOUND:
+	case TPT_ERR_IRD_OVERFLOW:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+
+	default:
+		printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", 
+		       CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+	}
+done:
+	if (atomic_dec_and_test(&chp->refcnt))
+                wake_up(&chp->wait);
+	iwch_qp_rem_ref(&qhp->ibqp);
+out:
+	dev_kfree_skb_irq(skb);
+}


From swise at opengridcomputing.com  Sat Dec  2 14:50:38 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:38 -0600
Subject: [openib-general] [PATCH  v2 08/13] Memory Registration
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225038.27014.90811.stgit@dell3.ogc.int>


Functions to register memory regions.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  170 ++++++++++++++++++++++++++++++++
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 0000000..774d11e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	if (cxio_register_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	/* We could support this... */
+	if (npages > mhp->attr.pbl_size)
+		return -ENOMEM;
+
+	stag = mhp->attr.stag;
+	if (cxio_reregister_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list)
+{
+	u64 mask;
+	int i, j, n;
+
+	mask = 0;
+	*total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+			return -EINVAL;
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return -EINVAL;
+		*total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	if (*total_size > 0xFFFFFFFFULL)
+		return -ENOMEM;
+
+	/* Find largest page shift we can use to cover buffers */
+	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
+		if (num_phys_buf > 1) {
+			if ((1ULL << *shift) & mask)
+				break;
+		} else 
+			if (1ULL << *shift >=
+			    buffer_list[0].size +
+			    (buffer_list[0].addr & ((1ULL << *shift) - 1)))
+				break;
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
+	buffer_list[0].addr &= ~0ull << *shift;
+
+	*npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		*npages += (buffer_list[i].size + 
+			(1ULL << *shift) - 1) >> *shift;
+
+	if (!*npages)
+		return -EINVAL;
+
+	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
+	if (!*page_list)
+		return -ENOMEM;
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
+		     ++j) 
+			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
+			    ((u64) j << *shift));
+
+	PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n",
+	     __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages);
+
+	return 0;
+
+}


From swise at opengridcomputing.com  Sat Dec  2 14:50:48 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:48 -0600
Subject: [openib-general] [PATCH  v2 09/13] Core WQE/CQE Types
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225048.27014.69535.stgit@dell3.ogc.int>


T3 WQE and CQE structures, defines, etc...

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  685 ++++++++++++++++++++++++++++
 1 files changed, 685 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 0000000..45870be
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,685 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include <asm/io.h>
+#include <linux/pci.h>
+#include <linux/timer.h>
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE      4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+				       ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<<size_log2)-((wptr)-(rptr)))
+#define Q_COUNT(rptr,wptr) ((wptr)-(rptr))
+#define Q_PTR2IDX(ptr,size_log2) (ptr & ((1UL<<size_log2)-1))
+
+static inline void ring_doorbell(void __iomem *doorbell, u32 qpid) 
+{
+	writel(((1<<31) | qpid), doorbell);
+}
+
+#define SEQ32_GE(x,y) (!( (((u32) (x)) - ((u32) (y))) & 0x80000000 ))
+
+enum t3_wr_flags {
+	T3_COMPLETION_FLAG = 0x01,
+	T3_NOTIFY_FLAG = 0x02,
+	T3_SOLICITED_EVENT_FLAG = 0x04,
+	T3_READ_FENCE_FLAG = 0x08,
+	T3_LOCAL_FENCE_FLAG = 0x10
+} __attribute__ ((packed));
+
+enum t3_wr_opcode {
+	T3_WR_BP = FW_WROPCODE_RI_BYPASS,
+	T3_WR_SEND = FW_WROPCODE_RI_SEND,
+	T3_WR_WRITE = FW_WROPCODE_RI_RDMA_WRITE,
+	T3_WR_READ = FW_WROPCODE_RI_RDMA_READ,
+	T3_WR_INV_STAG = FW_WROPCODE_RI_LOCAL_INV,
+	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
+	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
+	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+} __attribute__ ((packed));
+
+enum t3_rdma_opcode {
+	T3_RDMA_WRITE,		/* IETF RDMAP v1.0 ... */
+	T3_READ_REQ,
+	T3_READ_RESP,
+	T3_SEND,
+	T3_SEND_WITH_INV,
+	T3_SEND_WITH_SE,
+	T3_SEND_WITH_SE_INV,
+	T3_TERMINATE,
+	T3_RDMA_INIT,		/* CHELSIO RI specific ... */
+	T3_BIND_MW,
+	T3_FAST_REGISTER,
+	T3_LOCAL_INV,
+	T3_QP_MOD,
+	T3_BYPASS
+} __attribute__ ((packed));
+
+static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
+{
+	switch (wrop) {
+		case T3_WR_BP: return T3_BYPASS;
+		case T3_WR_SEND: return T3_SEND;
+		case T3_WR_WRITE: return T3_RDMA_WRITE;
+		case T3_WR_READ: return T3_READ_REQ;
+		case T3_WR_INV_STAG: return T3_LOCAL_INV;
+		case T3_WR_BIND: return T3_BIND_MW;
+		case T3_WR_INIT: return T3_RDMA_INIT;
+		case T3_WR_QP_MOD: return T3_QP_MOD;
+		default: break;
+	}
+	return -1;
+}
+
+
+/* Work request id */
+union t3_wrid {
+	struct {
+		u32 hi;
+		u32 low;
+	} id0;
+	u64 id1;
+};
+
+#define WRID(wrid)      	(wrid.id1)
+#define WRID_GEN(wrid)		(wrid.id0.wr_gen)
+#define WRID_IDX(wrid)		(wrid.id0.wr_idx)
+#define WRID_LO(wrid)		(wrid.id0.wr_lo)
+
+struct fw_riwrh {
+	__be32 op_seop_flags;
+	__be32 gen_tid_len;
+};
+
+#define S_FW_RIWR_OP		24
+#define M_FW_RIWR_OP		0xff
+#define V_FW_RIWR_OP(x)		((x) << S_FW_RIWR_OP)
+#define G_FW_RIWR_OP(x)   	((((x) >> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP	22
+#define M_FW_RIWR_SOPEOP	0x3
+#define V_FW_RIWR_SOPEOP(x)	((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS		8
+#define M_FW_RIWR_FLAGS		0x3fffff
+#define V_FW_RIWR_FLAGS(x)	((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x)   	((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID		8
+#define V_FW_RIWR_TID(x)	((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN		0
+#define V_FW_RIWR_LEN(x)	((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN           31
+#define V_FW_RIWR_GEN(x)        ((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+	__be32 stag;
+	__be32 len;
+	__be64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;	
+	__be32 plen;		/* 3 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
+};
+
+struct t3_local_inv_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 stag;		/* 2 */
+	__be32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 stag_sink;
+	__be64 to_sink;		/* 3 */
+	__be32 plen;		/* 4 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 5+ */
+};
+
+struct t3_rdma_read_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;
+	__be64 rem_to;		/* 3 */
+	__be32 local_stag;	/* 4 */
+	__be32 local_len;
+	__be64 local_to;	/* 5 */
+};
+
+enum t3_addr_type {
+	T3_VA_BASED_TO = 0x0,
+	T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+	T3_MEM_ACCESS_LOCAL_READ = 0x1,
+	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+	T3_MEM_ACCESS_REM_READ = 0x4,
+	T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u16 reserved;		/* 2 */
+	u8 type;
+	u8 perms;
+	__be32 mr_stag;
+	__be32 mw_stag;		/* 3 */
+	__be32 mw_len;
+	__be64 mw_va;		/* 4 */
+	__be32 mr_pbl_addr;	/* 5 */
+	u8 reserved2[3];
+	u8 mr_pagesz;
+};
+
+struct t3_receive_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 pagesz[T3_MAX_SGE];
+	__be32 num_sgle;		/* 2 */
+	struct t3_sge sgl[T3_MAX_SGE];	/* 3+ */
+	__be32 pbl_addr[T3_MAX_SGE];
+};
+
+struct t3_bypass_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;	/* 1 */
+};
+
+struct t3_modify_qp_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 flags;		/* 2 */
+	__be32 quiesce;		/* 2 */
+	__be32 max_ird;		/* 3 */
+	__be32 max_ord;		/* 3 */
+	__be64 sge_cmd;		/* 4 */
+	__be64 ctx1;		/* 5 */
+	__be64 ctx0;		/* 6 */
+};
+
+enum t3_modify_qp_flags {
+	MODQP_QUIESCE  = 0x01,
+	MODQP_MAX_IRD  = 0x02,
+	MODQP_MAX_ORD  = 0x04,
+	MODQP_WRITE_EC = 0x08,
+	MODQP_READ_EC  = 0x10,
+};
+	
+
+enum t3_mpa_attrs {
+	uP_RI_MPA_RX_MARKER_ENABLE = 0x1,
+	uP_RI_MPA_TX_MARKER_ENABLE = 0x2,
+	uP_RI_MPA_CRC_ENABLE = 0x4,
+	uP_RI_MPA_IETF_ENABLE = 0x8
+} __attribute__ ((packed));
+
+enum t3_qp_caps {
+	uP_RI_QP_RDMA_READ_ENABLE = 0x01,
+	uP_RI_QP_RDMA_WRITE_ENABLE = 0x02,
+	uP_RI_QP_BIND_ENABLE = 0x04,
+	uP_RI_QP_FAST_REGISTER_ENABLE = 0x08,
+	uP_RI_QP_STAG0_ENABLE = 0x10
+} __attribute__ ((packed));
+
+struct t3_rdma_init_attr {
+	u32 tid;
+	u32 qpid;
+	u32 pdid;
+	u32 scqid;
+	u32 rcqid;
+	u32 rq_addr;
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs;
+	enum t3_qp_caps qpcaps;
+	u16 tcp_emss;
+	u32 ord;
+	u32 ird;
+	u64 qp_dma_addr;
+	u32 qp_dma_size;
+	u32 flags;
+};
+
+struct t3_rdma_init_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 qpid;		/* 2 */
+	__be32 pdid;
+	__be32 scqid;		/* 3 */
+	__be32 rcqid;
+	__be32 rq_addr;		/* 4 */
+	__be32 rq_size;
+	u8 mpaattrs;		/* 5 */
+	u8 qpcaps;
+	__be16 ulpdu_size;
+	__be32 flags;		/* bits 31-1 - reservered */
+				/* bit     0 - set if RECV posted */
+	__be32 ord;		/* 6 */
+	__be32 ird;
+	__be64 qp_dma_addr;	/* 7 */
+	__be32 qp_dma_size;	/* 8 */
+	u32 rsvd;
+};
+
+struct t3_genbit {
+	u64 flit[15];
+	__be64 genbit;
+};
+
+enum rdma_init_wr_flags {
+	RECVS_POSTED = 1,
+};
+
+union t3_wr {
+	struct t3_send_wr send;
+	struct t3_rdma_write_wr write;
+	struct t3_rdma_read_wr read;
+	struct t3_receive_wr recv;
+	struct t3_local_inv_wr local_inv;
+	struct t3_bind_mw_wr bind;
+	struct t3_bypass_wr bypass;
+	struct t3_rdma_init_wr init;
+	struct t3_modify_qp_wr qp_mod;
+	struct t3_genbit genbit;
+	u64 flit[16];
+};
+
+#define T3_SQ_CQE_FLIT 	  13
+#define T3_SQ_COOKIE_FLIT 14
+
+#define T3_RQ_COOKIE_FLIT 13
+#define T3_RQ_CQE_FLIT 	  14
+
+static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
+{
+	return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags));
+}
+
+static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
+				  enum t3_wr_flags flags, u8 genbit, u32 tid,
+				  u8 len)
+{
+	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
+					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_FLAGS(flags));
+	wmb();
+	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
+				       V_FW_RIWR_TID(tid) |
+				       V_FW_RIWR_LEN(len));
+	/* 2nd gen bit... */
+        ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit);
+}
+
+/*
+ * T3 ULP2_TX commands
+ */
+enum t3_utx_mem_op {
+	T3_UTX_MEM_READ = 2,
+	T3_UTX_MEM_WRITE = 3
+};
+
+/* T3 MC7 RDMA TPT entry format */
+
+enum tpt_mem_type {
+	TPT_NON_SHARED_MR = 0x0,
+	TPT_SHARED_MR = 0x1,
+	TPT_MW = 0x2,
+	TPT_MW_RELAXED_PROTECTION = 0x3
+};
+
+enum tpt_addr_type {
+	TPT_ZBTO = 0,
+	TPT_VATO = 1
+};
+
+enum tpt_mem_perm {
+	TPT_LOCAL_READ = 0x8,
+	TPT_LOCAL_WRITE = 0x4,
+	TPT_REMOTE_READ = 0x2,
+	TPT_REMOTE_WRITE = 0x1
+};
+
+struct tpt_entry {
+	__be32 valid_stag_pdid;
+	__be32 flags_pagesize_qpid;
+
+	__be32 rsvd_pbl_addr;
+	__be32 len;
+	__be32 va_hi;
+	__be32 va_low_or_fbo;
+
+	__be32 rsvd_bind_cnt_or_pstag;
+	__be32 rsvd_pbl_size;
+};
+
+#define S_TPT_VALID		31
+#define V_TPT_VALID(x)		((x) << S_TPT_VALID)
+#define F_TPT_VALID		V_TPT_VALID(1U)
+
+#define S_TPT_STAG_KEY		23
+#define M_TPT_STAG_KEY		0xFF
+#define V_TPT_STAG_KEY(x)	((x) << S_TPT_STAG_KEY)
+#define G_TPT_STAG_KEY(x)	(((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY)
+
+#define S_TPT_STAG_STATE	22
+#define V_TPT_STAG_STATE(x)	((x) << S_TPT_STAG_STATE)
+#define F_TPT_STAG_STATE	V_TPT_STAG_STATE(1U)
+
+#define S_TPT_STAG_TYPE		20
+#define M_TPT_STAG_TYPE		0x3
+#define V_TPT_STAG_TYPE(x)	((x) << S_TPT_STAG_TYPE)
+#define G_TPT_STAG_TYPE(x)	(((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE)
+
+#define S_TPT_PDID		0
+#define M_TPT_PDID		0xFFFFF
+#define V_TPT_PDID(x)		((x) << S_TPT_PDID)
+#define G_TPT_PDID(x)		(((x) >> S_TPT_PDID) & M_TPT_PDID)
+
+#define S_TPT_PERM		28
+#define M_TPT_PERM		0xF
+#define V_TPT_PERM(x)		((x) << S_TPT_PERM)
+#define G_TPT_PERM(x)		(((x) >> S_TPT_PERM) & M_TPT_PERM)
+
+#define S_TPT_REM_INV_DIS	27
+#define V_TPT_REM_INV_DIS(x)	((x) << S_TPT_REM_INV_DIS)
+#define F_TPT_REM_INV_DIS	V_TPT_REM_INV_DIS(1U)
+
+#define S_TPT_ADDR_TYPE		26
+#define V_TPT_ADDR_TYPE(x)	((x) << S_TPT_ADDR_TYPE)
+#define F_TPT_ADDR_TYPE		V_TPT_ADDR_TYPE(1U)
+
+#define S_TPT_MW_BIND_ENABLE	25
+#define V_TPT_MW_BIND_ENABLE(x)	((x) << S_TPT_MW_BIND_ENABLE)
+#define F_TPT_MW_BIND_ENABLE    V_TPT_MW_BIND_ENABLE(1U)
+
+#define S_TPT_PAGE_SIZE		20
+#define M_TPT_PAGE_SIZE		0x1F
+#define V_TPT_PAGE_SIZE(x)	((x) << S_TPT_PAGE_SIZE)
+#define G_TPT_PAGE_SIZE(x)	(((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE)
+
+#define S_TPT_PBL_ADDR		0
+#define M_TPT_PBL_ADDR		0x1FFFFFFF
+#define V_TPT_PBL_ADDR(x)	((x) << S_TPT_PBL_ADDR)
+#define G_TPT_PBL_ADDR(x)       (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR)
+
+#define S_TPT_QPID		0
+#define M_TPT_QPID		0xFFFFF
+#define V_TPT_QPID(x)		((x) << S_TPT_QPID)
+#define G_TPT_QPID(x)		(((x) >> S_TPT_QPID) & M_TPT_QPID)
+
+#define S_TPT_PSTAG		0
+#define M_TPT_PSTAG		0xFFFFFF
+#define V_TPT_PSTAG(x)		((x) << S_TPT_PSTAG)
+#define G_TPT_PSTAG(x)		(((x) >> S_TPT_PSTAG) & M_TPT_PSTAG)
+
+#define S_TPT_PBL_SIZE		0
+#define M_TPT_PBL_SIZE		0xFFFFF
+#define V_TPT_PBL_SIZE(x)	((x) << S_TPT_PBL_SIZE)
+#define G_TPT_PBL_SIZE(x)	(((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE)
+
+/*
+ * CQE defs
+ */
+struct t3_cqe {
+	__be32 header;
+	__be32 len;
+	union {
+		struct {
+			__be32 stag;
+			__be32 msn;
+		} rcqe;
+		struct {
+			u32 wrid_hi;	
+			u32 wrid_low;
+		} scqe;
+	} u;
+};
+
+#define S_CQE_OOO	  31
+#define M_CQE_OOO	  0x1
+#define G_CQE_OOO(x)	  ((((x) >> S_CQE_OOO)) & M_CQE_OOO)
+#define V_CEQ_OOO(x)	  ((x)<<S_CQE_OOO)
+
+#define S_CQE_QPID        12
+#define M_CQE_QPID        0x7FFFF
+#define G_CQE_QPID(x)     ((((x) >> S_CQE_QPID)) & M_CQE_QPID)
+#define V_CQE_QPID(x) 	  ((x)<<S_CQE_QPID)
+
+#define S_CQE_SWCQE       11
+#define M_CQE_SWCQE       0x1
+#define G_CQE_SWCQE(x)    ((((x) >> S_CQE_SWCQE)) & M_CQE_SWCQE)
+#define V_CQE_SWCQE(x) 	  ((x)<<S_CQE_SWCQE)
+
+#define S_CQE_GENBIT      10
+#define M_CQE_GENBIT      0x1
+#define G_CQE_GENBIT(x)   (((x) >> S_CQE_GENBIT) & M_CQE_GENBIT)
+#define V_CQE_GENBIT(x)	  ((x)<<S_CQE_GENBIT)
+
+#define S_CQE_STATUS      5
+#define M_CQE_STATUS      0x1F
+#define G_CQE_STATUS(x)   ((((x) >> S_CQE_STATUS)) & M_CQE_STATUS)
+#define V_CQE_STATUS(x)   ((x)<<S_CQE_STATUS)
+
+#define S_CQE_TYPE        4
+#define M_CQE_TYPE        0x1
+#define G_CQE_TYPE(x)     ((((x) >> S_CQE_TYPE)) & M_CQE_TYPE)
+#define V_CQE_TYPE(x)     ((x)<<S_CQE_TYPE)
+
+#define S_CQE_OPCODE      0
+#define M_CQE_OPCODE      0xF
+#define G_CQE_OPCODE(x)   ((((x) >> S_CQE_OPCODE)) & M_CQE_OPCODE)
+#define V_CQE_OPCODE(x)   ((x)<<S_CQE_OPCODE)
+
+#define SW_CQE(x)         (G_CQE_SWCQE(be32_to_cpu((x).header)))
+#define CQE_OOO(x)        (G_CQE_OOO(be32_to_cpu((x).header)))
+#define CQE_QPID(x)       (G_CQE_QPID(be32_to_cpu((x).header)))
+#define CQE_GENBIT(x)     (G_CQE_GENBIT(be32_to_cpu((x).header)))
+#define CQE_TYPE(x)       (G_CQE_TYPE(be32_to_cpu((x).header)))
+#define SQ_TYPE(x)	  (CQE_TYPE((x)))
+#define RQ_TYPE(x)	  (!CQE_TYPE((x)))
+#define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
+#define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
+
+#define CQE_LEN(x)        (be32_to_cpu((x).len))
+
+/* used for RQ completion processing */
+#define CQE_WRID_STAG(x)  (be32_to_cpu((x).u.rcqe.stag))
+#define CQE_WRID_MSN(x)   (be32_to_cpu((x).u.rcqe.msn))
+
+/* used for SQ completion processing */
+#define CQE_WRID_SQ_WPTR(x)	((x).u.scqe.wrid_hi)
+#define CQE_WRID_WPTR(x)   	((x).u.scqe.wrid_low)
+
+/* generic accessor macros */
+#define CQE_WRID_HI(x)		((x).u.scqe.wrid_hi)
+#define CQE_WRID_LOW(x) 	((x).u.scqe.wrid_low)
+
+#define TPT_ERR_SUCCESS                     0x0
+#define TPT_ERR_STAG                        0x1	 /* STAG invalid: either the */
+						 /* STAG is offlimt, being 0, */
+						 /* or STAG_key mismatch */
+#define TPT_ERR_PDID                        0x2	 /* PDID mismatch */
+#define TPT_ERR_QPID                        0x3	 /* QPID mismatch */
+#define TPT_ERR_ACCESS                      0x4	 /* Invalid access right */
+#define TPT_ERR_WRAP                        0x5	 /* Wrap error */
+#define TPT_ERR_BOUND                       0x6	 /* base and bounds voilation */
+#define TPT_ERR_INVALIDATE_SHARED_MR        0x7	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND 0x8	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_ECC                         0x9	 /* ECC error detected */
+#define TPT_ERR_ECC_PSTAG                   0xA	 /* ECC error detected when  */
+						 /* reading PSTAG for a MW  */
+						 /* Invalidate */
+#define TPT_ERR_PBL_ADDR_BOUND              0xB	 /* pbl addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_SWFLUSH			    0xC	 /* SW FLUSHED */
+#define TPT_ERR_CRC                         0x10 /* CRC error */
+#define TPT_ERR_MARKER                      0x11 /* Marker error */
+#define TPT_ERR_PDU_LEN_ERR                 0x12 /* invalid PDU length */
+#define TPT_ERR_OUT_OF_RQE                  0x13 /* out of RQE */
+#define TPT_ERR_DDP_VERSION                 0x14 /* wrong DDP version */
+#define TPT_ERR_RDMA_VERSION                0x15 /* wrong RDMA version */
+#define TPT_ERR_OPCODE                      0x16 /* invalid rdma opcode */
+#define TPT_ERR_DDP_QUEUE_NUM               0x17 /* invalid ddp queue number */
+#define TPT_ERR_MSN                         0x18 /* MSN error */
+#define TPT_ERR_TBIT                        0x19 /* tag bit not set correctly */
+#define TPT_ERR_MO                          0x1A /* MO not 0 for TERMINATE  */
+						 /* or READ_REQ */
+#define TPT_ERR_MSN_GAP                     0x1B
+#define TPT_ERR_MSN_RANGE                   0x1C
+#define TPT_ERR_IRD_OVERFLOW                0x1D
+#define TPT_ERR_RQE_ADDR_BOUND              0x1E /* RQE addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_INTERNAL_ERR                0x1F /* internal error (opcode  */
+						 /* mismatch) */
+
+struct t3_swsq {
+	__u64 			wr_id;
+	struct t3_cqe 		cqe;
+	__u32			sq_wptr;
+	__be32			read_len;
+	int 			opcode;
+	int			complete;
+	int			signaled;	
+};
+
+/*
+ * A T3 WQ implements both the SQ and RQ.
+ */
+struct t3_wq {
+	union t3_wr *queue;		/* DMA accessable memory */
+	dma_addr_t dma_addr;		/* DMA address for HW */
+	DECLARE_PCI_UNMAP_ADDR(mapping)	/* unmap kruft */
+	u32 error;			/* 1 once we go to ERROR */
+	u32 qpid;
+	u32 wptr;			/* idx to next available WR slot */
+	u32 size_log2;			/* total wq size */
+	struct t3_swsq *sq;		/* SW SQ */
+	struct t3_swsq *oldest_read;	/* tracks oldest pending read */
+	u32 sq_wptr;			/* sq_wptr - sq_rptr == count of */
+	u32 sq_rptr;			/* pending wrs */
+	u32 sq_size_log2;		/* sq size */
+	u64 *rq;			/* SW RQ (holds consumer wr_ids */
+	u32 rq_wptr;			/* rq_wptr - rq_rptr == count of */
+	u32 rq_rptr;			/* pending wrs */
+	u64 *rq_oldest_wr;		/* oldest wr on the SW RQ */
+	u32 rq_size_log2;		/* rq size */
+	u32 rq_addr;			/* rq adapter address */
+	void __iomem *doorbell;		/* kernel db */
+	u64 udb;			/* user db if any */
+};
+
+struct t3_cq {
+	u32 cqid;
+	u32 rptr;
+	u32 wptr;
+	u32 size_log2;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct t3_cqe *queue;
+	struct t3_cqe *sw_queue;
+	u32 sw_rptr;
+	u32 sw_wptr;
+};
+
+#define CQ_VLD_ENTRY(ptr,size_log2,cqe) (Q_GENBIT(ptr,size_log2) == \
+					 CQE_GENBIT(*cqe))
+
+static inline void cxio_set_wq_in_error(struct t3_wq *wq)
+{
+	wq->queue->flit[13] = 1;
+}
+
+static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:50:58 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:50:58 -0600
Subject: [openib-general] [PATCH  v2 10/13] Core HAL
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225058.27014.33454.stgit@dell3.ogc.int>


The RDMA Core interfaces with the T3 HW and ULLD providing a low level
RDMA interface.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  201 ++++
 2 files changed, 1503 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 0000000..367c834
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1302 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/semaphore.h>
+#include <asm/delay.h>
+
+#include <linux/netdevice.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+#include "sge_defs.h"
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+							     *tdev)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (rdev_tbl[i]->t3cdev_p == tdev)
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (!rdev_tbl[i]) {
+			rdev_tbl[i] = rdev_p;
+			break;
+		}
+	return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i] == rdev_p) {
+			rdev_tbl[i] = NULL;
+			break;
+		}
+}
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, 
+		   enum t3_cq_opcode op, u32 credit)
+{
+	int ret;
+	struct t3_cqe *cqe;
+	u32 rptr;
+
+	struct rdma_cq_op setup;
+	setup.id = cq->cqid;
+	setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+	setup.op = op;
+	ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup);
+
+	if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) 
+		return ret;
+
+	/*
+	 * If the rearm returned an index other than our current index,
+	 * then there might be CQE's in flight (being DMA'd).  We must wait
+	 * here for them to complete or the consumer can miss a notification.
+	 */
+	if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+		int i=0;
+
+		rptr = cq->rptr;
+
+		/* 
+		 * Keep the generation correct by bumping rptr until it
+		 * matches the index returned by the rearm - 1.
+	 	 */
+		while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+			rptr++;
+
+		/* 
+		 * Now rptr is the index for the (last) cqe that was 
+	 	 * in-flight at the time the HW rearmed the CQ.  We 
+		 * spin until that CQE is valid.
+	 	 */
+		cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2);
+		while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) {
+			udelay(1);
+			if (i++ > 1000000) {
+				BUG_ON(1);
+				printk(KERN_ERR "%s: stalled rnic\n", 
+				       rdev_p->dev_name);
+				return -EIO;
+			}
+		}
+	}
+	return 0;
+}
+
+static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cqid;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 0;		/* disaable the CQ */
+	setup.credits = 0;
+	setup.credit_thres = 0;
+	setup.ovfl_mode = 0;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
+{
+	u64 sge_cmd;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = qpid << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe);
+
+	cq->cqid = cxio_hal_get_cqid(rdev_p->rscp);
+	if (!cq->cqid)
+		return -ENOMEM;
+	cq->sw_queue = kzalloc(size, GFP_KERNEL);
+	if (!cq->sw_queue)
+		return -ENOMEM;
+	cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     (1UL << (cq->size_log2)) *
+					     sizeof(struct t3_cqe),
+					     &(cq->dma_addr), GFP_KERNEL);
+	if (!cq->queue) {
+		kfree(cq->sw_queue);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(cq, mapping, cq->dma_addr);
+	memset(cq->queue, 0, size);
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = 65535;
+	setup.credit_thres = 1;
+	if (rdev_p->t3cdev_p->type == T3B)
+		setup.ovfl_mode = 0;
+	else
+		setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = setup.size;
+	setup.credit_thres = setup.size;	/* TBD: overflow recovery */
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	u32 qpid;
+	int i;
+
+	mutex_lock(&uctx->lock);
+	if (!list_empty(&uctx->qpids)) {
+		entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, 
+				   entry);
+		list_del(&entry->entry);
+		qpid = entry->qpid;
+		kfree(entry);
+	} else {
+		qpid = cxio_hal_get_qpid(rdev_p->rscp);
+		if (!qpid) 
+			goto out;
+		for (i = qpid+1; i & rdev_p->qpmask; i++) {
+			entry = kmalloc(sizeof *entry, GFP_KERNEL);
+			if (!entry)
+				break;
+			entry->qpid = i;
+			list_add_tail(&entry->entry, &uctx->qpids);
+		}
+	}
+out:
+	mutex_unlock(&uctx->lock);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, 
+		     struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	
+	entry = kmalloc(sizeof *entry, GFP_KERNEL);
+	if (!entry) 
+		return;
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	entry->qpid = qpid;
+	mutex_lock(&uctx->lock);
+	list_add_tail(&entry->entry, &uctx->qpids);
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct list_head *pos, *nxt;
+	struct cxio_qpid_list *entry;
+
+	mutex_lock(&uctx->lock);
+	list_for_each_safe(pos, nxt, &uctx->qpids) {
+		entry = list_entry(pos, struct cxio_qpid_list, entry);
+		list_del_init(&entry->entry);
+		if (!(entry->qpid & rdev_p->qpmask))
+			cxio_hal_put_qpid(rdev_p->rscp, entry->qpid);
+		kfree(entry);
+	}
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	INIT_LIST_HEAD(&uctx->qpids);
+	mutex_init(&uctx->lock);
+}
+
+int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain,
+		   struct t3_wq *wq, struct cxio_ucontext *uctx)
+{
+	int depth = 1UL << wq->size_log2;
+	int rqsize = 1UL << wq->rq_size_log2;
+
+	wq->qpid = get_qpid(rdev_p, uctx);
+	if (!wq->qpid)
+		return -ENOMEM;
+
+	wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL);
+	if (!wq->rq)
+		goto err1;
+
+	wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize);
+	if (!wq->rq_addr)
+		goto err2;
+
+	wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL);
+	if (!wq->sq)
+		goto err3;
+	
+	wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     depth * sizeof(union t3_wr),
+					     &(wq->dma_addr), GFP_KERNEL);
+	if (!wq->queue)
+		goto err4;
+
+	memset(wq->queue, 0, depth * sizeof(union t3_wr));
+	pci_unmap_addr_set(wq, mapping, wq->dma_addr);
+	wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	if (!kernel_domain)
+		wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + 
+					(wq->qpid << rdev_p->qpshift);
+	PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, 
+	     wq->qpid, wq->doorbell, wq->udb);
+	return 0;
+err4:
+	kfree(wq->sq);
+err3:
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize);
+err2:
+	kfree(wq->rq);
+err1:
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return -ENOMEM;
+}
+
+int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	int err;
+	err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid);
+	kfree(cq->sw_queue);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (cq->size_log2))
+			  * sizeof(struct t3_cqe), cq->queue, 
+			  pci_unmap_addr(cq, mapping));
+	cxio_hal_put_cqid(rdev_p->rscp, cq->cqid);
+	return err;
+}
+
+int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (wq->size_log2))
+			  * sizeof(union t3_wr), wq->queue, 
+			  pci_unmap_addr(wq, mapping));
+	kfree(wq->sq);
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2));
+	kfree(wq->rq);
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return 0;
+}
+
+static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(T3_SEND) | 
+		         	 V_CQE_TYPE(0) |
+		         	 V_CQE_SWCQE(1) |
+		         	 V_CQE_QPID(wq->qpid) | 
+		         	 V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	u32 ptr;
+
+	PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq);
+
+	/* flush RQ */
+	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, 
+	    wq->rq_rptr, wq->rq_wptr, count);
+	ptr = wq->rq_rptr + count;
+	while (ptr++ != wq->rq_wptr)
+		insert_recv_cqe(wq, cq);
+}
+
+static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, 
+		          struct t3_swsq *sqp)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(sqp->opcode) |
+			         V_CQE_TYPE(1) |
+			         V_CQE_SWCQE(1) |
+			         V_CQE_QPID(wq->qpid) | 
+			         V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	cqe.u.scqe.wrid_hi = sqp->sq_wptr;
+
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	__u32 ptr;
+	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
+
+	ptr = wq->sq_rptr + count;
+	sqp += count;
+	while (ptr != wq->sq_wptr) {
+		insert_sq_cqe(wq, cq, sqp);
+		sqp++;
+		ptr++;
+	}
+}
+
+/* 
+ * Move all CQEs from the HWCQ into the SWCQ.
+ */
+void cxio_flush_hw_cq(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe, *swcqe;
+
+	PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid);
+	cqe = cxio_next_hw_cqe(cq);
+	while (cqe) {
+		PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", 
+		     __FUNCTION__, cq->rptr, cq->sw_wptr);
+		swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2);
+		*swcqe = *cqe;
+		swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1));
+		cq->sw_wptr++;
+		cq->rptr++;
+		cqe = cxio_next_hw_cqe(cq);
+	}
+}
+
+static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq)
+{
+	if (CQE_OPCODE(*cqe) == T3_TERMINATE) 
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) &&
+	    Q_EMPTY(wq->rq_rptr, wq->rq_wptr))
+		return 0;
+
+	return 1;
+}
+
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && 
+		    (CQE_QPID(*cqe) == wq->qpid))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	PDBG("%s count zero %d\n", __FUNCTION__, *count);
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && 
+		    (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p)
+{
+	struct rdma_cq_setup setup;
+	setup.id = 0;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 1;		/* enable the CQ */
+	setup.credits = 0;
+
+	/* force SGE to redirect to RspQ and interrupt */
+	setup.credit_thres = 0;	
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	int err;
+	u64 sge_cmd, ctx0, ctx1;
+	u64 base_addr;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+
+
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	err = cxio_hal_init_ctrl_cq(rdev_p);
+	if (err) {
+		PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err);
+		return err;
+	}
+	rdev_p->ctrl_qp.workq = dma_alloc_coherent(
+					&(rdev_p->rnic_info.pdev->dev),
+					(1 << T3_CTRL_QP_SIZE_LOG2) *
+					sizeof(union t3_wr),
+					&(rdev_p->ctrl_qp.dma_addr), 
+					GFP_KERNEL);
+	if (!rdev_p->ctrl_qp.workq) {
+		PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, 
+			   rdev_p->ctrl_qp.dma_addr);
+	rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	memset(rdev_p->ctrl_qp.workq, 0,
+	       (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr));
+
+	init_MUTEX(&rdev_p->ctrl_qp.sem);
+	init_waitqueue_head(&rdev_p->ctrl_qp.waitq);
+
+	/* update HW Ctrl QP context */
+	base_addr = rdev_p->ctrl_qp.dma_addr;
+	base_addr >>= 12;
+	ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) |
+		V_EC_BASE_LO((u32) base_addr & 0xffff));
+	ctx0 <<= 32;
+	ctx0 |= V_EC_CREDITS(FW_WR_NUM);
+	base_addr >>= 16;
+	ctx1 = (u32) base_addr;
+	base_addr >>= 32;
+	ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) |
+			V_EC_TYPE(0) | V_EC_GEN(1) |
+			V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32;
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1,
+		       T3_CTL_QP_TID, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = cpu_to_be64(ctx1);
+	wqe->ctx0 = cpu_to_be64(ctx0);
+	PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n",
+	     (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq,
+	     1 << T3_CTRL_QP_SIZE_LOG2);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << T3_CTRL_QP_SIZE_LOG2)
+			  * sizeof(union t3_wr), rdev_p->ctrl_qp.workq,
+			  pci_unmap_addr(&rdev_p->ctrl_qp, mapping));
+	return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID);
+}
+
+/* write len bytes of data into addr (32B aligned address) 
+ * If data is NULL, clear len byte of memory to zero.
+ * caller aquires the sem before the call
+ */
+static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
+				      u32 len, void *data, int completion)
+{
+	u32 i, nr_wqe, copy_len;
+	u8 *copy_data;
+	u8 wr_len, utx_len;	/* lenght in 8 byte flit */
+	enum t3_wr_flags flag;
+	__be64 *wqe;
+	u64 utx_cmd;
+	addr &= 0x7FFFFFF;
+	nr_wqe = len % 96 ? len / 96 + 1 : len / 96;	/* 96B max per WQE */
+	PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n",
+	     __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, 
+	     nr_wqe, data, addr);
+	utx_len = 3;		/* in 32B unit */
+	for (i = 0; i < nr_wqe; i++) {
+		if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr,
+		           T3_CTRL_QP_SIZE_LOG2)) {
+			PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, "
+			     "wait for more space i %d\n", __FUNCTION__, 
+			     rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i);
+			if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     !Q_FULL(rdev_p->ctrl_qp.rptr,
+						     rdev_p->ctrl_qp.wptr,
+						     T3_CTRL_QP_SIZE_LOG2))) {
+				PDBG("%s ctrl_qp workq interrupted\n",
+				     __FUNCTION__);
+				return -ERESTARTSYS;
+			}
+			PDBG("%s ctrl_qp wakeup, continue posting work request "
+			     "i %d\n", __FUNCTION__, i);
+		}
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+						(1 << T3_CTRL_QP_SIZE_LOG2)));
+		flag = 0;
+		if (i == (nr_wqe - 1)) {
+			/* last WQE */
+			flag = completion ? T3_COMPLETION_FLAG : 0;
+			if (len % 32)
+				utx_len = len / 32 + 1;
+			else
+				utx_len = len / 32;
+		}
+
+		/* 
+		 * Force a CQE to return the credit to the workq in case 
+		 * we posted more than half the max QP size of WRs 
+		 */
+		if ((i != 0) && 
+		    (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) {
+			flag = T3_COMPLETION_FLAG;
+			PDBG("%s force completion at i %d\n", __FUNCTION__, i);
+		}
+
+		/* build the utx mem command */
+		wqe += (sizeof(struct t3_bypass_wr) >> 3);
+		utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3);
+		utx_cmd <<= 32;
+		utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1);
+		*wqe = cpu_to_be64(utx_cmd);
+		wqe++;
+		copy_data = (u8 *) data + i * 96;
+		copy_len = len > 96 ? 96 : len;
+
+		/* clear memory content if data is NULL */
+		if (data)
+			memcpy(wqe, copy_data, copy_len);
+		else
+			memset(wqe, 0, copy_len);
+		if (copy_len % 32)
+			memset(((u8 *) wqe) + copy_len, 0,
+			       32 - (copy_len % 32));
+		wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + 
+			 (utx_len << 2);
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+
+		/* wptr in the WRID[31:0] */
+		((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr;
+
+		/* 
+		 * This must be the last write with a memory barrier 
+		 * for the genbit 
+		 */
+		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
+			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
+					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
+			       wr_len);
+		if (flag == T3_COMPLETION_FLAG)
+			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
+		len -= 96;
+		rdev_p->ctrl_qp.wptr++;
+	}
+	return 0;
+}
+
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
+ * OUT: stag index, actual pbl_size, pbl_addr allocated.
+ * TBD: shared memory region support
+ */
+static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
+			 u32 *stag, u8 stag_state, u32 pdid,
+			 enum tpt_mem_type type, enum tpt_mem_perm perm,
+			 u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl,
+			 u32 *pbl_size, u32 *pbl_addr)
+{
+	int err;
+	struct tpt_entry tpt;
+	u32 stag_idx;
+	u32 wptr;
+	int rereg = (*stag != T3_STAG_UNSET);
+
+	stag_state = stag_state > 0;
+	stag_idx = (*stag) >> 8;
+
+	if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) {
+		stag_idx = cxio_hal_get_stag(rdev_p->rscp);
+		if (!stag_idx)
+			return -ENOMEM;
+		*stag = (stag_idx << 8) | ((*stag) & 0xFF);
+	}
+	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", 
+	     __FUNCTION__, stag_state, type, pdid, stag_idx);
+	
+	if (reset_tpt_entry) 
+		cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3);
+	else if (!rereg) {
+		*pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3);
+		if (!*pbl_addr) {
+			return -ENOMEM;
+		}
+	}
+
+	down_interruptible(&rdev_p->ctrl_qp.sem);
+
+	/* write PBL first if any - update pbl only if pbl list exist */
+	if (pbl) {
+
+		PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
+		     __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, 
+		     *pbl_size);
+		err = cxio_hal_ctrl_qp_write_mem(rdev_p, 
+				(*pbl_addr >> 5),
+				(*pbl_size << 3), pbl, 0);
+		if (err)
+			goto ret;
+	}
+
+	/* write TPT entry */
+	if (reset_tpt_entry)
+		memset(&tpt, 0, sizeof(tpt));
+	else {
+		tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID |
+				V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) |
+				V_TPT_STAG_STATE(stag_state) |
+				V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid));
+		BUG_ON(page_size >= 28);
+		tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | 
+			    	F_TPT_MW_BIND_ENABLE |
+				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
+				V_TPT_PAGE_SIZE(page_size));
+		tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : 
+				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3));
+		tpt.len = cpu_to_be32(len);
+		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
+		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
+		tpt.rsvd_bind_cnt_or_pstag = 0;
+		tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : 
+				  cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2));
+	}
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				       stag_idx +
+				       (rdev_p->rnic_info.tpt_base >> 5),
+				       sizeof(tpt), &tpt, 1);
+
+	/* release the stag index to free pool */
+	if (reset_tpt_entry)
+		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
+ret:	
+	wptr = rdev_p->ctrl_qp.wptr;
+	up(&rdev_p->ctrl_qp.sem);
+	if (!err)
+		if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+						      wptr)))
+			return -ERESTARTSYS;
+	return err;
+}
+
+/* IN : stag key, pdid, pbl_size
+ * Out: stag index, actaul pbl_size, and pbl_addr allocated. 
+ */
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			      perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr));
+}
+
+int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     &pbl_size, &pbl_addr);
+}
+
+int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	u32 pbl_size = 0;
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
+			     NULL, &pbl_size, NULL);
+}
+
+int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
+{
+	struct t3_rdma_init_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p);
+	wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe));
+	wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT));
+	wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) |
+					   V_FW_RIWR_LEN(sizeof(*wqe) >> 3));
+	wqe->wrid.id1 = 0;
+	wqe->qpid = cpu_to_be32(attr->qpid);
+	wqe->pdid = cpu_to_be32(attr->pdid);
+	wqe->scqid = cpu_to_be32(attr->scqid);
+	wqe->rcqid = cpu_to_be32(attr->rcqid);
+	wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base);
+	wqe->rq_size = cpu_to_be32(attr->rq_size);
+	wqe->mpaattrs = attr->mpaattrs;
+	wqe->qpcaps = attr->qpcaps;
+	wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss);
+	wqe->flags = cpu_to_be32(attr->flags);
+	wqe->ord = cpu_to_be32(attr->ord);
+	wqe->ird = cpu_to_be32(attr->ird);
+	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
+	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
+	wqe->rsvd = 0;
+	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = ev_cb;
+}
+
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = NULL;
+}
+
+static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb)
+{
+	static int cnt;
+	struct cxio_rdev *rdev_p = NULL;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x"
+	     " se %0x notify %0x cqbranch %0x creditth %0x\n",
+	     cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg),
+	     RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg),
+	     RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg),
+	     RSPQ_CREDIT_THRESH(rsp_msg));
+	PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d "
+	     "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), 
+	     CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+	     CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), 
+	     CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+	rdev_p = (struct cxio_rdev *)t3cdev_p->ulp;
+	if (!rdev_p) {
+		PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__,
+		     t3cdev_p);
+		return 0;
+	}
+	if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) {
+		rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1;
+		wake_up_interruptible(&rdev_p->ctrl_qp.waitq);
+		dev_kfree_skb_irq(skb);
+	} else if (CQE_QPID(rsp_msg->cqe) == 0xfff8)
+		dev_kfree_skb_irq(skb);
+	else if (cxio_ev_cb)
+		(*cxio_ev_cb) (rdev_p, skb);
+	else
+		dev_kfree_skb_irq(skb);
+	cnt++;
+	return 0;
+}
+
+/* Caller takes care of locking if needed */
+int cxio_rdev_open(struct cxio_rdev *rdev_p)
+{
+	struct net_device *netdev_p = NULL;
+	int err = 0;
+	if (strlen(rdev_p->dev_name)) {
+		if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) {
+			return -EBUSY;
+		}
+		netdev_p = dev_get_by_name(rdev_p->dev_name);
+		if (!netdev_p) {
+			return -EINVAL;
+		}
+		dev_put(netdev_p);
+	} else if (rdev_p->t3cdev_p) {
+		if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) {
+			return -EBUSY;
+		}
+		netdev_p = rdev_p->t3cdev_p->lldev;
+		strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name,
+			T3_MAX_DEV_NAME_LEN);
+	} else {
+		PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	if (cxio_hal_add_rdev(rdev_p))
+		return -ENOMEM;
+
+	PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name);
+	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
+	if (!rdev_p->t3cdev_p)
+		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
+					 &(rdev_p->rnic_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS,
+				    &(rdev_p->port_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+
+	/* 
+	 * qpshift is the number of bits to shift the qpid left in order
+	 * to get the correct address of the doorbell for that qp.
+	 */
+	cxio_init_ucontext(rdev_p, &rdev_p->uctx);
+	rdev_p->qpshift = PAGE_SHIFT - 
+			  long_log2(65536 >> 
+			            long_log2(rdev_p->rnic_info.udbell_len >> 
+					      PAGE_SHIFT));
+	rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT;
+	rdev_p->qpmask = (65536 >> long_log2(rdev_p->qpnr)) - 1;
+	PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d "
+	     "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", 
+	     __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, 
+  	     rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), 
+  	     rdev_p->rnic_info.pbl_base, 
+  	     rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base,
+  	     rdev_p->rnic_info.rqt_top);
+	PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu "
+	     "qpnr %d qpmask 0x%x\n", 
+	     rdev_p->rnic_info.udbell_len, 
+	     rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr,
+	     rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask);
+
+	err = cxio_hal_init_ctrl_qp(rdev_p);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", 
+		       __FUNCTION__, err);
+		goto err1;
+	}
+ 	err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0,
+				     0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ,
+				     T3_MAX_NUM_PD);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing hal resources.\n", 
+		       __FUNCTION__, err);
+		goto err2;
+	}
+ 	err = cxio_hal_pblpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing pbl mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err3;
+ 	}
+ 	err = cxio_hal_rqtpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing rqt mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err4;
+ 	}
+  	return 0;
+err4:
+ 	cxio_hal_pblpool_destroy(rdev_p);
+err3:
+ 	cxio_hal_destroy_resource(rdev_p->rscp);
+err2:
+	cxio_hal_destroy_ctrl_qp(rdev_p);
+err1:
+	cxio_hal_delete_rdev(rdev_p);
+	return err;
+}
+
+void cxio_rdev_close(struct cxio_rdev *rdev_p)
+{
+	if (rdev_p) {
+		cxio_hal_pblpool_destroy(rdev_p);
+		cxio_hal_rqtpool_destroy(rdev_p);
+		cxio_hal_delete_rdev(rdev_p);
+		rdev_p->t3cdev_p->ulp = NULL;
+		cxio_hal_destroy_ctrl_qp(rdev_p);
+		cxio_hal_destroy_resource(rdev_p->rscp);
+	}
+}
+
+int __init cxio_hal_init(void)
+{
+	if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI))
+		return -ENOMEM;
+	memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *));
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler);
+	return 0;
+}
+
+void __exit cxio_hal_exit(void)
+{
+	int i;
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL);
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		cxio_rdev_close(rdev_tbl[i]);
+	cxio_hal_destroy_rhdl_resource();
+}
+
+static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_swsq *sqp;
+	__u32 ptr = wq->sq_rptr;
+	int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr);
+	
+	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
+	while (count--)
+		if (!sqp->signaled) {
+			ptr++;
+			sqp = wq->sq + Q_PTR2IDX(ptr,  wq->sq_size_log2);
+		} else if (sqp->complete) {
+
+			/* 
+			 * Insert this completed cqe into the swcq.
+			 */
+			PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n",
+			     __FUNCTION__, Q_PTR2IDX(ptr,  wq->sq_size_log2),
+			     Q_PTR2IDX(cq->sw_wptr, cq->size_log2));
+			sqp->cqe.header |= htonl(V_CQE_SWCQE(1));
+			*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) 
+				= sqp->cqe;
+			cq->sw_wptr++;
+			sqp->signaled = 0;
+			break;
+		} else
+			break;
+}
+
+static inline void create_read_req_cqe(struct t3_wq *wq,
+				       struct t3_cqe *hw_cqe,
+				       struct t3_cqe *read_cqe)
+{
+	read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr;
+	read_cqe->len = wq->oldest_read->read_len;
+	read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) |
+				 V_CQE_SWCQE(SW_CQE(*hw_cqe)) |
+				 V_CQE_OPCODE(T3_READ_REQ) |
+				 V_CQE_TYPE(1));
+}
+
+/*
+ * Return a ptr to the next read wr in the SWSQ or NULL.
+ */
+static inline void advance_oldest_read(struct t3_wq *wq)
+{
+
+	u32 rptr = wq->oldest_read - wq->sq + 1;
+	u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2);
+
+	while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) {
+		wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2);
+
+		if (wq->oldest_read->opcode == T3_READ_REQ)
+			return;
+		rptr++;
+	}
+	wq->oldest_read = NULL;
+}
+
+/*
+ * cxio_poll_cq
+ *
+ * Caller must:
+ *     check the validity of the first CQE,
+ *     supply the wq assicated with the qpid.
+ *
+ * credit: cq credit to return to sge.
+ * cqe_flushed: 1 iff the CQE is flushed.
+ * cqe: copy of the polled CQE.
+ *
+ * return value:
+ *     0       CQE returned,
+ *    -1       CQE skipped, try again.
+ */
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit)
+{
+	int ret = 0;
+	struct t3_cqe *hw_cqe, read_cqe;
+
+	*cqe_flushed = 0;
+	*credit = 0;
+	hw_cqe = cxio_next_cqe(cq);
+
+	PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x"
+	     " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), 
+	     CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), 
+	     CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), 
+	     CQE_WRID_LOW(*hw_cqe));
+
+	/* 
+	 * skip cqe's not affiliated with a QP.
+	 */
+	if (wq == NULL) {
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	/*
+	 * Gotta tweak READ completions:
+	 * 	1) the cqe doesn't contain the sq_wptr from the wr.
+	 *	2) opcode not reflected from the wr.
+	 *	3) read_len not reflected from the wr.
+	 *	4) cq_type is RQ_TYPE not SQ_TYPE.
+	 */
+	if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) {
+		
+		/* 
+	 	 * Don't write to the HWCQ, so create a new read req CQE 
+		 * in local memory.
+		 */
+		create_read_req_cqe(wq, hw_cqe, &read_cqe);
+		hw_cqe = &read_cqe;
+		advance_oldest_read(wq);
+	}
+
+	/*
+ 	 * T3A: Discard TERMINATE CQEs.
+	 */
+	if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) {
+		ret = -1;
+		wq->error = 1;
+		goto skip_cqe;
+	}
+
+	if (CQE_STATUS(*hw_cqe) || wq->error) {
+		*cqe_flushed = wq->error;
+		wq->error = 1;
+	
+		/* 
+		 * T3A inserts errors into the CQE.  We cannot return 
+	 	 * these as work completions.
+	 	 */
+		/* incoming write failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) 
+		     && RQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		/* incoming read request failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+
+		/* incoming SEND with no receive posted failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) &&
+		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+	 * RECV completion.
+	 */
+	if (RQ_TYPE(*hw_cqe)) {
+
+		/* 
+		 * HW only validates 4 bits of MSN.  So we must validate that
+		 * the MSN in the SEND is the next expected MSN.  If its not,
+		 * then we complete this with TPT_ERR_MSN and mark the wq in 
+		 * error.
+		 */
+		if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) {
+			wq->error = 1;
+			hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN));
+			goto proc_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/* 
+ 	 * If we get here its a send completion.
+	 *
+	 * Handle out of order completion. These get stuffed
+	 * in the SW SQ. Then the SW SQ is walked to move any
+	 * now in-order completions into the SW CQ.  This handles
+	 * 2 cases:
+	 * 	1) reaping unsignaled WRs when the first subsequent
+	 *	   signaled WR is completed.
+	 *	2) out of order read completions.
+	 */
+	if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) {
+		struct t3_swsq *sqp;
+
+		PDBG("%s out of order completion going in swsq at idx %ld\n",
+		     __FUNCTION__, 
+		     Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2));
+		sqp = wq->sq + 
+		      Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2);
+		sqp->cqe = *hw_cqe;
+		sqp->complete = 1;
+		ret = -1;
+		goto flush_wq;
+	}
+	
+proc_cqe:
+	*cqe = *hw_cqe;
+
+	/*
+	 * Reap the associated WR(s) that are freed up with this
+	 * completion.
+	 */
+	if (SQ_TYPE(*hw_cqe)) {
+		wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe);
+		PDBG("%s completing sq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2));
+		*cookie = (wq->sq + 
+			   Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id;
+		wq->sq_rptr++;
+	} else {
+		PDBG("%s completing rq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		*cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		wq->rq_rptr++;
+	}
+
+flush_wq:
+	/*
+	 * Flush any completed cqes that are now in-order.
+	 */
+	flush_completed_wrs(wq, cq);
+
+skip_cqe:
+	if (SW_CQE(*hw_cqe)) {
+		PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->sw_rptr);
+		++cq->sw_rptr;
+	} else {
+		PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->rptr);
+		++cq->rptr;
+
+		/*
+		 * T3A: compute credits.
+		 */
+		if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1)))
+		    || ((cq->rptr - cq->wptr) >= 128)) {
+			*credit = cq->rptr - cq->wptr;
+			cq->wptr = cq->rptr;
+		}
+	}
+	return ret;
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
new file mode 100644
index 0000000..bde5cfb
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
@@ -0,0 +1,201 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef  __CXIO_HAL_H__
+#define  __CXIO_HAL_H__
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#include "t3_cpl.h"
+#include "t3cdev.h"
+#include "cxgb3_ctl_defs.h"
+#include "cxio_wr.h"
+
+#define T3_CTRL_QP_ID    FW_RI_SGEEC_START
+#define T3_CTL_QP_TID	 FW_RI_TID_START
+#define T3_CTRL_QP_SIZE_LOG2  8
+#define T3_CTRL_CQ_ID    0
+
+/* TBD */
+#define T3_MAX_NUM_RNIC  8
+#define T3_MAX_NUM_RI (1<<15)
+#define T3_MAX_NUM_QP (1<<15)
+#define T3_MAX_NUM_CQ (1<<15)
+#define T3_MAX_NUM_PD (1<<15)
+#define T3_MAX_PBL_SIZE 256
+#define T3_MAX_RQ_SIZE 1024
+#define T3_MAX_NUM_STAG (1<<15)
+
+#define T3_STAG_UNSET 0xffffffff
+
+#define T3_MAX_DEV_NAME_LEN 32
+
+struct cxio_hal_ctrl_qp {
+	u32 wptr;
+	u32 rptr;
+	struct semaphore sem;	/* for the wtpr, can sleep */
+	wait_queue_head_t waitq;	/* wait for RspQ/CQE msg */
+	union t3_wr *workq;	/* the work request queue */
+	dma_addr_t dma_addr;	/* pci bus address of the workq */
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	void __iomem *doorbell;
+};
+
+struct cxio_hal_resource {
+	struct kfifo *tpt_fifo;
+	spinlock_t tpt_fifo_lock;
+	struct kfifo *qpid_fifo;
+	spinlock_t qpid_fifo_lock;
+	struct kfifo *cqid_fifo;
+	spinlock_t cqid_fifo_lock;
+	struct kfifo *pdid_fifo;
+	spinlock_t pdid_fifo_lock;
+};
+
+struct cxio_qpid_list {
+	struct list_head entry;
+	u32 qpid;
+};
+
+struct cxio_ucontext {
+	struct list_head qpids;
+	struct mutex lock;
+};
+
+struct cxio_rdev {
+	char dev_name[T3_MAX_DEV_NAME_LEN];
+	struct t3cdev *t3cdev_p;
+	struct rdma_info rnic_info;
+	struct adap_ports port_info;
+	struct cxio_hal_resource *rscp;
+	struct cxio_hal_ctrl_qp ctrl_qp;
+	void *ulp;
+	unsigned long qpshift;
+	u32 qpnr;
+	u32 qpmask;
+	struct cxio_ucontext uctx;
+	struct gen_pool *pbl_pool;
+	struct gen_pool *rqt_pool;
+};
+
+static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
+{
+	return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5));
+}
+
+typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p,
+					     struct sk_buff * skb);
+
+#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff)
+#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff)
+#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1)
+#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1)
+#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1)
+#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1)
+#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1)
+#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1)
+#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1)
+
+struct respQ_msg_t {
+	__be32 flags;		/* flit 0 */
+	__be32 cq_ptrid;
+	__be64 rsvd;		/* flit 1 */
+	struct t3_cqe cqe;	/* flits 2-3 */
+};
+
+enum t3_cq_opcode {
+	CQ_ARM_AN = 0x2,
+	CQ_ARM_SE = 0x6,
+	CQ_FORCE_AN = 0x3,
+	CQ_CREDIT_UPDATE = 0x7
+};
+
+int cxio_rdev_open(struct cxio_rdev *rdev);
+void cxio_rdev_close(struct cxio_rdev *rdev);
+int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, 
+	 	   enum t3_cq_opcode op, u32 credit);
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid);
+int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq,
+		   struct cxio_ucontext *uctx);
+int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx);
+int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr);
+int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr);
+int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
+int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+u32 cxio_hal_get_rhdl(void);
+void cxio_hal_put_rhdl(u32 rhdl);
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
+int __init cxio_hal_init(void);
+void __exit cxio_hal_exit(void);
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_flush_hw_cq(struct t3_cq *cq);
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+
+#define MOD "iw_cxgb3: "
+#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
+
+#ifdef DEBUG
+void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag);
+void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift);
+void cxio_dump_wqe(union t3_wr *wqe);
+void cxio_dump_wce(struct t3_cqe *wce);
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents);
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid);
+#endif
+
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:51:09 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:51:09 -0600
Subject: [openib-general] [PATCH  v2 11/13] Core Resource Allocation
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225108.27014.11770.stgit@dell3.ogc.int>


Core functions to carve up adapter memory, stag, qp, and cq IDs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  331 ++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_resource.h |   70 +++++
 2 files changed, 401 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 0000000..444df15
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,331 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+				   spinlock_t *fifo_lock,
+				   u32 nr, u32 skip_low,
+				   u32 skip_high,
+				   int random)
+{
+	u32 i, j, entry = 0, idx;
+	u32 random_bytes;
+	u32 rarray[16];
+	spin_lock_init(fifo_lock);
+
+	*fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+	if (IS_ERR(*fifo))
+		return -ENOMEM;
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		__kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32));
+	if (random) {
+		j = 0;
+		random_bytes = random32();
+		for (i = 0; i < RANDOM_SIZE; i++)
+			rarray[i] = i + skip_low;
+		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+			if (j >= RANDOM_SIZE) {
+				j = 0;
+				random_bytes = random32();
+			}
+			idx = (random_bytes >> (j * 2)) & 0xF;
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[idx],
+				sizeof(u32));
+			rarray[idx] = i;
+			j++;	
+		}
+		for (i = 0; i < RANDOM_SIZE; i++)
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[i],
+				sizeof(u32));
+	} else
+		for (i = skip_low; i < nr - skip_high; i++)
+			__kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32));
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32));
+	return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+				   spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 1));
+}
+
+static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p)
+{
+	u32 i;
+
+	spin_lock_init(&rdev_p->rscp->qpid_fifo_lock);
+
+	rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), 
+					      GFP_KERNEL, 
+					      &rdev_p->rscp->qpid_fifo_lock);
+	if (IS_ERR(rdev_p->rscp->qpid_fifo))
+		return -ENOMEM;
+
+	for (i = 16; i < T3_MAX_NUM_QP; i++)
+		if (!(i & rdev_p->qpmask))
+			__kfifo_put(rdev_p->rscp->qpid_fifo, 
+				    (unsigned char *) &i, sizeof(u32));
+	return 0;
+}
+
+int cxio_hal_init_rhdl_resource(u32 nr_rhdl)
+{
+	return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1,
+				       0);
+}
+
+void cxio_hal_destroy_rhdl_resource(void)
+{
+	kfifo_free(rhdl_fifo);
+}
+
+/* nr_* must be power of 2 */
+int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+			   u32 nr_tpt, u32 nr_pbl,
+			   u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid)
+{
+	int err = 0;
+	struct cxio_hal_resource *rscp;
+
+	rscp = kmalloc(sizeof(*rscp), GFP_KERNEL);
+	if (!rscp)
+		return -ENOMEM;
+	rdev_p->rscp = rscp;
+	err = cxio_init_resource_fifo_random(&rscp->tpt_fifo,
+				      &rscp->tpt_fifo_lock, 
+				      nr_tpt, 1, 0);
+	if (err)
+		goto tpt_err;
+	err = cxio_init_qpid_fifo(rdev_p);
+	if (err)
+		goto qpid_err;
+	err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, 
+				      nr_cqid, 1, 0);
+	if (err)
+		goto cqid_err;
+	err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, 
+				      nr_pdid, 1, 0);
+	if (err)
+		goto pdid_err;
+	return 0;
+pdid_err:
+	kfifo_free(rscp->cqid_fifo);
+cqid_err:
+	kfifo_free(rscp->qpid_fifo);
+qpid_err:
+	kfifo_free(rscp->tpt_fifo);
+tpt_err:
+	return -ENOMEM;
+}
+
+/*
+ * returns 0 if no resource available
+ */
+static inline u32 cxio_hal_get_resource(struct kfifo *fifo)
+{
+	u32 entry;
+	if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32)))
+		return entry;
+	else
+		return 0;	/* fifo emptry */
+}
+
+static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry)
+{
+	BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0);
+}
+
+u32 cxio_hal_get_rhdl(void)
+{
+	return cxio_hal_get_resource(rhdl_fifo);
+}
+
+void cxio_hal_put_rhdl(u32 rhdl)
+{
+	cxio_hal_put_resource(rhdl_fifo, rhdl);
+}
+
+u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->tpt_fifo);
+}
+
+void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag)
+{
+	cxio_hal_put_resource(rscp->tpt_fifo, stag);
+}
+
+u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp)
+{
+	u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid)
+{
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	cxio_hal_put_resource(rscp->qpid_fifo, qpid);
+}
+
+u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->cqid_fifo);
+}
+
+void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid)
+{
+	cxio_hal_put_resource(rscp->cqid_fifo, cqid);
+}
+
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->pdid_fifo);
+}
+
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid)
+{
+	cxio_hal_put_resource(rscp->pdid_fifo, pdid);
+}
+
+void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
+{
+	kfifo_free(rscp->tpt_fifo);
+	kfifo_free(rscp->cqid_fifo);
+	kfifo_free(rscp->qpid_fifo);
+	kfifo_free(rscp->pdid_fifo);
+	kfree(rscp);
+}
+
+/*
+ * PBL Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_PBL_SHIFT 8			/* 256B == min PBL size (32 entries) */
+#define PBL_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size);
+	return (u32)addr;
+}
+
+void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size);
+	gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size);
+}
+
+int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1);
+	if (rdev_p->pbl_pool)
+		for (i = rdev_p->rnic_info.pbl_base; 
+		     i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; 
+		     i += PBL_CHUNK)
+			gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1);
+	return rdev_p->pbl_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->pbl_pool);
+}
+
+/*
+ * RQT Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_RQT_SHIFT 10	/* 1KB == mini RQT size (16 entries) */
+#define RQT_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6);
+	return (u32)addr;
+}
+
+void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6);
+	gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6);
+}
+
+int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1);
+	if (rdev_p->rqt_pool)
+		for (i = rdev_p->rnic_info.rqt_base; 
+		     i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; 
+		     i += RQT_CHUNK)
+			gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1);
+	return rdev_p->rqt_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->rqt_pool);
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
new file mode 100644
index 0000000..a6bbe83
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_RESOURCE_H__
+#define __CXIO_RESOURCE_H__
+
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/genalloc.h>
+#include "cxio_hal.h"
+
+extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl);
+extern void cxio_hal_destroy_rhdl_resource(void);
+extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+				  u32 nr_tpt, u32 nr_pbl,
+				  u32 nr_rqt, u32 nr_qpid, u32 nr_cqid,
+				  u32 nr_pdid);
+extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag);
+extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid);
+extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid);
+extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp);
+
+#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base )
+extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+
+#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base )
+extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:51:19 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:51:19 -0600
Subject: [openib-general] [PATCH  v2 12/13] Core Debug functions
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225119.27014.65672.stgit@dell3.ogc.int>


Debug code to dump various data structs, some of which are in 
adapter memory.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  205 +++++++++++++++++++++++++++
 1 files changed, 205 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 0000000..22f4f75
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include <linux/types.h>
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) 
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size = 32;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+	m->len = size;
+	PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size, npages;
+
+	shift += 12;
+	npages = (len + (1ULL << shift) - 1) >> shift;
+	size = npages * sizeof(u64);
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = pbl_addr;
+	m->len = size;
+	PDBG("%s PBL addr 0x%x len %d depth %d\n", 
+		__FUNCTION__, m->addr, m->len, npages);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+	__be64 *data = (__be64 *)wqe;
+	uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+	if (size == 0) 
+		size = 8;
+	while (size > 0) {
+		PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+		size--;
+		data++;
+	}
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+	__be64 *data = (__be64 *)wce;
+	int size = sizeof(*wce);
+
+	while (size > 0) {
+		PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+		size -= 8;
+		data++;
+	}
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+	struct ch_mem_range *m;
+	int size = nents * 64;
+	u64 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base;
+	m->len = size;
+	PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid)
+{
+	struct ch_mem_range *m;
+	int size = TCB_SIZE;
+	u32 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_CM;
+	m->addr = hwtid * size; 
+	m->len = size;
+	PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u32 *)m->buf;
+	while (size > 0) {
+		printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", 
+			m->addr, 
+			*(data+2), *(data+3), *(data),*(data+1),
+			*(data+6), *(data+7), *(data+4), *(data+5));
+		size -= 32;
+		data += 8;
+		m->addr += 32;
+	}
+	kfree(m);
+}
+#endif


From swise at opengridcomputing.com  Sat Dec  2 14:51:29 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sat, 02 Dec 2006 16:51:29 -0600
Subject: [openib-general] [PATCH  v2 13/13] Kconfig/Makefile
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202225129.27014.42302.stgit@dell3.ogc.int>


Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/Kconfig              |    1 +
 drivers/infiniband/Makefile             |    1 +
 drivers/infiniband/hw/cxgb3/Kconfig     |   27 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/Makefile    |   12 ++++++++++++
 drivers/infiniband/hw/cxgb3/locking.txt |   25 +++++++++++++++++++++++++
 5 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 59b3932..06453ab 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/ehca/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index 570b30a..69bdd55 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mt
 obj-$(CONFIG_INFINIBAND_IPATH)		+= hw/ipath/
 obj-$(CONFIG_INFINIBAND_EHCA)		+= hw/ehca/
 obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
+obj-$(CONFIG_INFINIBAND_CXGB3)		+= hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 0000000..84f0f6e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,27 @@
+config INFINIBAND_CXGB3
+	tristate "Chelsio RDMA Driver"
+	depends on CHELSIO_T3 && INFINIBAND
+	select GENERIC_ALLOCATOR
+	---help---
+	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
+	  10GbE adapters.
+
+          For general information about Chelsio and our products, visit
+          our website at <http://www.chelsio.com>.
+
+          For customer support, please visit our customer support page at
+          <http://www.chelsio.com/support.htm>.
+
+          Please send feedback to <linux-bugs at chelsio.com>.
+
+          To compile this driver as a module, choose M here: the module
+          will be called iw_cxgb3.
+
+config INFINIBAND_CXGB3_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_CXGB3
+	default n
+	---help---
+	  This option causes the Chelsio RDMA driver to produce copious
+	  amounts of debug messages.  Select this if you are developing
+	  the driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 0000000..0df2b3d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \
+		-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+	       iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_CXGB3_DEBUG
+EXTRA_CFLAGS += -DDEBUG -O1 -g 
+iw_cxgb3-y += core/cxio_dbg.o
+endif
diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt
new file mode 100644
index 0000000..e5e9991
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/locking.txt
@@ -0,0 +1,25 @@
+cq lock:
+	- spin lock
+	- used to synchronize the t3_cq
+
+qp lock:
+	- spin lock
+	- used to synchronize updates to the qp state, attrs, and the t3_wq.
+	- touched on interrupt and process context
+	
+rnicp lock:
+	- spin lock
+	- touched on interrupt and process context
+	- used around lookup tables mapping CQID and QPID to a structure.
+	- used also to bump the refcnt atomically with the lookup.
+
+poll:
+	lock+disable on cq lock
+		lock qp lock for each cqe that is polled around the call
+		to cxio_poll_cq().
+	
+post: 
+	lock+disable qp lock
+
+global mutex iwch_mutex:
+	used to maintain global device list.


From romieu at fr.zoreil.com  Sat Dec  2 15:13:30 2006
From: romieu at fr.zoreil.com (Francois Romieu)
Date: Sun, 3 Dec 2006 00:13:30 +0100
Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <20061202224917.27014.15424.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
Message-ID: <20061202231329.GA10719@electric-eye.fr.zoreil.com>

Steve Wise <swise at opengridcomputing.com> :
[...]
> Version 2 changes:
> 
> - Make code sparse endian clean
> - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays
> - Clean up confusing bitfields
> - Use random32() instead of local random function
> - Use krefs to track endpoint reference counts
> - Misc nits
> 
> -----
> 
> The following series implements the Chelsio T3 iWARP/RDMA Driver to
> be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
> Ethernet Driver which is also under review now for 2.6.20. See:

I understood that Stephen expressed some doubts regarding the inclusion
of TOE enabled features.

Was his point addressed ?

-- 
Ueimor


From shemminger at osdl.org  Sat Dec  2 16:24:47 2006
From: shemminger at osdl.org (Stephen Hemminger)
Date: Sat, 02 Dec 2006 16:24:47 -0800
Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <20061202231329.GA10719@electric-eye.fr.zoreil.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202231329.GA10719@electric-eye.fr.zoreil.com>
Message-ID: <4572194F.8060309@osdl.org>

Francois Romieu wrote:
> Steve Wise <swise at opengridcomputing.com> :
> [...]
>   
>> Version 2 changes:
>>
>> - Make code sparse endian clean
>> - Use IDRs for mapping QP and CQ IDs to structure pointers instead of arrays
>> - Clean up confusing bitfields
>> - Use random32() instead of local random function
>> - Use krefs to track endpoint reference counts
>> - Misc nits
>>
>> -----
>>
>> The following series implements the Chelsio T3 iWARP/RDMA Driver to
>> be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
>> Ethernet Driver which is also under review now for 2.6.20. See:
>>     
>
> I understood that Stephen expressed some doubts regarding the inclusion
> of TOE enabled features.
>
> Was his point addressed ?
>
>   

My comments were about different hardware.


From dotanb at dev.mellanox.co.il  Sat Dec  2 22:34:32 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 03 Dec 2006 08:34:32 +0200
Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in
 "send_lat"
In-Reply-To: <20061202213454.GB31661@cse.ohio-state.edu>
References: <20061202213454.GB31661@cse.ohio-state.edu>
Message-ID: <45726FF8.3000807@dev.mellanox.co.il>

Hi Sayantan.
Sayantan Sur wrote:

>Hi,
>
>I have a question about the "status" field for a completion which is due
>to RNR retry exceeded error. I trivially modified the `send_lat' program
>(from the Gen2 perftest directory) to use SRQ and not post receives
>after some specified time. Given the "rnr_retry" attribute of the QP not
>to be 7 (infinite retry), I'm expecting the sender to get an erroneous
>completion with IBV_WC_RNR_RETRY_EXC_ERR.
>
>So far so good ... however, the completion I pull out of the send_cq,
>lists the opcode of the completion to be IBV_WC_RECV! Is this expected?
>
>I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs
>(two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant
>Update 3), with kernel version 2.6.17.7.
>
>If someone could explain this behavior, or suggest a workaround, it'd be
>great.
>
>TIA,
>Sayantan.
>  
>
I toke the following text from the man pages that i wrote to the libibverbs:
"Not all wc attributes are always valid. If the  completion  status  is
       other  than  IBV_WC_SUCCESS,  only the following attributes are 
valid:
       wr_id, status, qp_num, and vendor_err."

In other words, the opcode is not valid if you have a completion with error.

Thanks
Dotan


From mst at mellanox.co.il  Sat Dec  2 23:12:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 3 Dec 2006 09:12:55 +0200
Subject: [openib-general] userspace git conversion status/cut over
In-Reply-To: <OFF7366536.951267CD-ONC1257236.006D73DC-C1257236.006EA120@de.ibm.com>
References: <1164904025.11808.133123.camel@hal.voltaire.com>
	<OFF7366536.951267CD-ONC1257236.006D73DC-C1257236.006EA120@de.ibm.com>
Message-ID: <20061203071255.GA4377@mellanox.co.il>

> Michael, I was reluctant to answer you, because I could remember you wrote
> once in a thread asking each maintainer to create/define branch names
> in a format so that ofed build script can pick the code properly.
> Unfortunately I could not find that anymore. Can you pls restate that?
> Currently libehca has only 1.0 and 1.1 (which is ofed-1.1.1).
> Thanks
> Nam

you only need 2 tags:

OFED 1.1 should be
refs/tags/vofed-1.1

Assuming you have code for 1.0 (which I don't recall ehca having), tag it with
refs/tags/vofed-1.0

otherwise remove the tag

-- 
MST


From ogerlitz at voltaire.com  Sat Dec  2 23:45:54 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 03 Dec 2006 09:45:54 +0200
Subject: [openib-general] NFS/RDMA for Linux: client and server update
 release 7
In-Reply-To: <EXNANE01tSgcOrBLSAQ00000180@exnane01.hq.netapp.com>
References: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com>
	<aday7prky6u.fsf@cisco.com>
	<EXNANE01tSgcOrBLSAQ00000180@exnane01.hq.netapp.com>
Message-ID: <457280B2.3030709@voltaire.com>

Talpey, Thomas wrote:
> At 06:12 PM 12/1/2006, Roland Dreier wrote:
>> What is the status of moving this code towards merging to the upstream kernel?
> 
> For the client there are two main prerequisites, both in the RPC layer
> and both in progress. One is the completion of the RPC transport switch
> merge, mainly the ability to load as modules. The second is a new mount
> syscall api, to allow transport-specific arguments to be passed in. We
> have a temporary solution for that at the moment. When these two are
> in place, the client is ready to consider merging.

When these two are ready for merging, note that you don't have to wait 
for them to be merged in rX and then push the client for rX+1, you can 
push them all together. Moreover, if the rnfs client is the only user of 
these features you might not be able to push them without it being 
pushed as well.

> Bottom line, we can put it on the table soon.

As was stated over this list few times in the past, as your code is an 
rdma driver which was never send out to this list for RFC (sending a 
pointer to some tgz does not count as its not the common practice in the 
linux kernel open source dev cycle) you better put it on the table 
sooner then later. Since 2.6.20 has been open, its seems the correct 
time if you consider pushing it for 2.6.21 .

Or.


From ogerlitz at voltaire.com  Sun Dec  3 00:31:43 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 03 Dec 2006 10:31:43 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
Message-ID: <45728B6F.6040905@voltaire.com>

Ralph Campbell wrote:
>> On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
>>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:

>>>> So what did you change since v1?  How do you deal with fitting 64-bit
>>>> addresses into an sg list entry that has a 32-bit dma_addr_t?

> Although the driver compiles on 32-bit kernels, it is unsupported
> and never been tested. All known 64-bit systems don't define
> CONFIG_HIGHMEM.  In spite of previous emails suggesting that
> page_address() can return NULL without CONFIG_HIGHMEM defined,
> the code in include/linux/mm.h doesn't allow it (assuming the
> page pointer is valid and not some random address).
> I verified this with Andrew Morton.

Can you provide the quote from include/linux/mm.h of the code that 
disallows it? looking there i don't see the enforcement.

mmm, your consulting with Andrew Morton was not over this thread... well
Christoph Hellwig comment on the V1 thread tells a different story:

Only for GFP_KERNEL allocations you can assume page_address is valid, 
and the scatterlist passed to a SCSI LLDD can contain any type of pages. 
  Currently on all 64bit architectures page_address works on all pages, 
but that's an implementation detail that could change any time and that 
you should not rely on.

see http://www.mail-archive.com/openib-general at openib.org/msg27132.html

As i have mentioned in the past, this (no kvaddr for a page) comes into 
play when a SCSI LLD (eg iSER, SRP) gets DIRECT I/O or AIO (SDP) pages 
from user space.

Or.


From ogerlitz at voltaire.com  Sun Dec  3 00:36:35 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 03 Dec 2006 10:36:35 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
Message-ID: <45728C93.6020504@voltaire.com>

Ralph Campbell wrote:
> Basically, use a hash table to store the kmap result.
> See attached for 90% of the code.
> static u64 ipath_dma_map_page(struct ib_device *dev,
> 			      struct page *page,
> 			      unsigned long offset,
> 			      size_t size,
> 			      enum dma_data_direction direction)
> {
> 	u64 addr;
> 
> 	BUG_ON(!valid_dma_direction(direction));
> 
> 	if (offset + size > PAGE_SIZE) {
> 		addr = BAD_DMA_ADDRESS;
> 		goto done;
> 	}
> 
> #ifdef CONFIG_HIGHMEM
> 	/* handle highmem pages */
> 	if (PageHighMem(page)) {
> 		void *v = kmap(page);

another comment we have got on iser, is that this code can be called 
context that requires kmap_atomic (and xxx_dma_unmap_page in a context 
that requires kunmap_atomic). This imposes another problem, since the 
kmap_atomic slots are somehow limited and with this patch the ipath 
driver would hold those mapping for relatively long time (ie it does not 
kmap/copy/kunmap).

> 
> 		if (!v)
> 			addr = BAD_DMA_ADDRESS;
> 		else {
> 			addr = (u64) v + offset;
> 			hash_insert(dev, v + offset, page);
> 		}
> 		goto done;
> 	}
> #endif
> 	addr = (u64) page_address(page);
> 	if (addr)
> 		addr += offset;
> 
> done:
> 	return addr;
> }


From ogerlitz at voltaire.com  Sun Dec  3 00:42:55 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 03 Dec 2006 10:42:55 +0200
Subject: [openib-general] [PATCH v2 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <1164911024.14800.74.camel@brick.pathscale.com>
References: <1164911024.14800.74.camel@brick.pathscale.com>
Message-ID: <45728E0F.9020106@voltaire.com>

Ralph Campbell wrote:
> This patch implements the interposing DMA mapping functions to allow
> support for IOMMUs and remove the dependence on phys_to_virt().

> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Wed Nov 29 13:55:07 2006 -0800
> +/**
> + * ipath_dma_map_single - Map a kernel virtual address to DMA address
> + * @device: The device for which the dma_addr is to be created
> + * @cpu_addr: The kernel virtual address
> + * @size: The size of the region in bytes
> + * @direction: The direction of the DMA
> + */
> +static u64 ipath_dma_map_single(struct ib_device *dev,
> +			        void *cpu_addr, size_t size,
> +			        enum dma_data_direction direction)
> +{
> +	BUG_ON(!valid_dma_direction(direction));
> +	return (u64) cpu_addr;
> +}

if ipath_dma_map_single is a NO OP

> +/**
> + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU
> + * @device: The device for which the DMA address was created
> + * @addr: The DMA address
> + * @size: The size of the region in bytes
> + * @dir: The direction of the DMA
> + */
> +static void ipath_sync_single_for_cpu(struct ib_device *dev,
> +				      u64 addr,
> +				      size_t size,
> +				      enum dma_data_direction dir)
> +{
> +	dma_sync_single_for_cpu(dev->dma_device, addr, size, dir);
> +}

then why ipath_sync_single_for_cpu does something? am i just pointing on 
a cleanup or there's something more deep here?

Or.


From tziporet at dev.mellanox.co.il  Sun Dec  3 02:19:04 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 03 Dec 2006 12:19:04 +0200
Subject: [openib-general] reminder: OFED 1.2 meeting next Monday
In-Reply-To: <1164895558.11808.128480.camel@hal.voltaire.com>
References: <456EE52E.2060208@dev.mellanox.co.il>
	<1164895558.11808.128480.camel@hal.voltaire.com>
Message-ID: <4572A498.4090201@dev.mellanox.co.il>

Hal Rosenstock wrote:
> On Thu, 2006-11-30 at 09:05, Tziporet Koren wrote:
>   
>> Hi All,
>> I wish to remind all that we have the EWG meeting on Monday 4-Dec at 
>> 9am-10am.
>>     
>
> Which tz ?
>
>   
9-10am PST

Meeting details (if you don't have them):
______________________________________________________________________________ 

Jeffrey Squyres has invited you to a Cisco MeetingPlace Conference

Date/Time:               DEC 4, 2006 at 12:00PM America/New_York
Length:                  60
Frequency:               10
Meeting ID:              2106670
Meeting Password:       

Global Access Numbers:
http://cisco.com/en/US/about/doing_business/conferencing/index.html

    US/Canada:  +1.866.432.9903    United Kingdom:   +44.20.8824.0117
    India:      +91.80.4103.3979   Germany:          +49.619.6773.9002
    Japan:      +81.3.5763.9394    China:            +86.10.8515.5666


From arjan at infradead.org  Sun Dec  3 04:07:18 2006
From: arjan at infradead.org (Arjan van de Ven)
Date: Sun, 03 Dec 2006 13:07:18 +0100
Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data
	Structures
In-Reply-To: <20061202224947.27014.59189.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224947.27014.59189.stgit@dell3.ogc.int>
Message-ID: <1165147639.3233.211.camel@laptopd505.fenrus.org>

On Sat, 2006-12-02 at 16:49 -0600, Steve Wise wrote:

> +
> +static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
> +				    struct ib_ah_attr *ah_attr)
> +{
> +	return ERR_PTR(-ENOSYS);
> +}


-ENOSYS is just about ALWAYS a bug in that it's guaranteed to be the
wrong error code ;)


From mst at mellanox.co.il  Sun Dec  3 04:47:06 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 3 Dec 2006 14:47:06 +0200
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <20061203124623.GA15614@mellanox.co.il>
References: <ada7ixdq0x8.fsf@cisco.com> <20061203124623.GA15614@mellanox.co.il>
Message-ID: <20061203124706.GB15614@mellanox.co.il>

> > Quoting r. Roland Dreier <rdreier at cisco.com>:
> > Subject: [GIT PULL] please pull infiniband.git
> > 
> > Linus, please pull from
> > 
> >     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> > 
> > This tree is also available from kernel.org mirrors at:
> > 
> >     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> > 
> 
> ...
> 
> > 
> >       IB/ucm: Fix deadlock in cleanup
> 
> Can this go into -stable for 2.6.18.x?

Sorry, that should have been 2.6.19.y.

-- 
MST


From mst at mellanox.co.il  Sun Dec  3 04:42:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 3 Dec 2006 14:42:43 +0200
Subject: [openib-general] [CM] what happen if the path in the REQ packet
 (primary or alternate) is not reversible?
In-Reply-To: <456C74C5.5070007@ichips.intel.com>
References: <456C74C5.5070007@ichips.intel.com>
Message-ID: <20061203124243.GE4296@mellanox.co.il>

> The reversible bit needs to be set as well.

Like this, then?

---

SRP must set IB_SA_PATH_REC_REVERSIBLE since that's the only kind of path
CM currently supports.

Untested.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>


diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 4b09147..df98754 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -266,6 +266,7 @@ static void srp_path_rec_completion(int
 static int srp_lookup_path(struct srp_target_port *target)
 {
 	target->path.numb_path = 1;
+	target->path.reversible = 1;
 
 	init_completion(&target->done);
 
@@ -276,6 +277,7 @@ static int srp_lookup_path(struct srp_ta
 						   IB_SA_PATH_REC_DGID		|
 						   IB_SA_PATH_REC_SGID		|
 						   IB_SA_PATH_REC_NUMB_PATH	|
+						   IB_SA_PATH_REC_REVERSIBLE    |
 						   IB_SA_PATH_REC_PKEY,
 						   SRP_PATH_REC_TIMEOUT_MS,
 						   GFP_KERNEL,

-- 
MST


From mst at mellanox.co.il  Sun Dec  3 04:46:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 3 Dec 2006 14:46:23 +0200
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <ada7ixdq0x8.fsf@cisco.com>
References: <ada7ixdq0x8.fsf@cisco.com>
Message-ID: <20061203124623.GA15614@mellanox.co.il>

> Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: [GIT PULL] please pull infiniband.git
> 
> Linus, please pull from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 

...

> 
>       IB/ucm: Fix deadlock in cleanup

Can this go into -stable for 2.6.18.x?

-- 
MST


From tziporet at dev.mellanox.co.il  Sun Dec  3 05:49:55 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 03 Dec 2006 15:49:55 +0200
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
References: <1E3DCD1C63492545881FACB6063A57C16E40C9@mtiexch01.mti.com>
Message-ID: <4572D603.6080101@dev.mellanox.co.il>

Boris Shpolyansky wrote:
> Hi David,
>  
> If you are using OFED-1.1 stack and OSU MVAPICH provided with the 
> OFED-1.1 package as your MPI layer,
> the attached patch should solve your problem.
>  
> Please, let me know if that helped.
>  
> Regards,
>  
Boris,
Please add this to OFED 1.1 support page

Thanks,
Tziporet


From jengelh at linux01.gwdg.de  Sun Dec  3 08:03:35 2006
From: jengelh at linux01.gwdg.de (Jan Engelhardt)
Date: Sun, 3 Dec 2006 17:03:35 +0100 (MET)
Subject: [openib-general] [PATCH v2 02/13] Device Discovery and ULLD
	Linkage
In-Reply-To: <20061202224937.27014.951.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224937.27014.951.stgit@dell3.ogc.int>
Message-ID: <Pine.LNX.4.61.0612031658160.25425@yvahk01.tjqt.qr>

Hi,


Some questions,suggestions,:

>+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];

Can it be static'ified? (I suppose not.)

>+struct cxgb3_client t3c_client = {
>+	.name = "iw_cxgb3",
>+	.add = open_rnic_dev,
>+	.remove = close_rnic_dev,
>+	.handlers = t3c_handlers,
>+	.redirect = iwch_ep_redirect
>+};

Can it be const'ified?

>+static void rnic_init(struct iwch_dev *rnicp)
>+{
>+	PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
>+	idr_init(&rnicp->cqidr);
>+	idr_init(&rnicp->qpidr);
>+	idr_init(&rnicp->mmidr);
>+	spin_lock_init(&rnicp->lock);
>+
>+	rnicp->attr.vendor_id = 0x168;
>+	rnicp->attr.vendor_part_id = 7;

Sugg.:

   typeof(rnicp->attr) *a = &rnicp->attr; // replace typeof with proper thing
   a->vendor_id = 0x168;
   a->vendor_part_id = 7;

shortens the lines a bit.

>+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
>+	rnicp->attr.max_wrs = (1UL << 24) - 1;
>+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
>+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
>+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
>+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
>+	rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev);
>+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
>+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
>+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
>+	rnicp->attr.can_resize_wq = 0;
>+	rnicp->attr.max_rdma_reads_per_qp = 8;
>+	rnicp->attr.max_rdma_read_resources =
>+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
>+	rnicp->attr.max_rdma_read_qp_depth = 8;	/* IRD */
>+	rnicp->attr.max_rdma_read_depth =
>+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
>+	rnicp->attr.rq_overflow_handled = 0;
>+	rnicp->attr.can_modify_ird = 0;
>+	rnicp->attr.can_modify_ord = 0;
>+	rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
>+	rnicp->attr.stag0_value = 1;
>+	rnicp->attr.zbva_support = 1;
>+	rnicp->attr.local_invalidate_fence = 1;
>+	rnicp->attr.cq_overflow_detection = 1;
>+	return;
>+}
>+
>--- /dev/null
>+++ b/drivers/infiniband/hw/cxgb3/iwch.h
>+static inline int t3b_device(struct iwch_dev *rhp)
>+{
>+	return (rhp->rdev.t3cdev_p->type == T3B);
>+}
>+
>+static inline int t3a_device(struct iwch_dev *rhp)
>+{
>+	return (rhp->rdev.t3cdev_p->type == T3A);
>+}

These two can be constified for sure: static inline int t3a_device(const 
struct iwch_dev *rhp)

>+
>+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid)
>+{
>+	return idr_find(&rhp->cqidr, cqid);
>+}
>+
>+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid)
>+{
>+	return idr_find(&rhp->qpidr, qpid);
>+}
>+
>+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid)
>+{
>+	return idr_find(&rhp->mmidr, mmid);
>+}

Here I am not sure.


	-`J'
-- 


From surs at cse.ohio-state.edu  Sun Dec  3 11:57:41 2006
From: surs at cse.ohio-state.edu (Sayantan Sur)
Date: Sun, 03 Dec 2006 14:57:41 -0500
Subject: [openib-general] RNR_RETRY_EXC_ERR and completion opcode in
 "send_lat"
In-Reply-To: <45726FF8.3000807@dev.mellanox.co.il>
References: <20061202213454.GB31661@cse.ohio-state.edu>
	<45726FF8.3000807@dev.mellanox.co.il>
Message-ID: <45732C35.5060107@cse.ohio-state.edu>

Hi Dotan,

Thanks a lot for this information.

Sayantan.

Dotan Barak wrote:
> Hi Sayantan.
> Sayantan Sur wrote:
>
>> Hi,
>>
>> I have a question about the "status" field for a completion which is due
>> to RNR retry exceeded error. I trivially modified the `send_lat' program
>> (from the Gen2 perftest directory) to use SRQ and not post receives
>> after some specified time. Given the "rnr_retry" attribute of the QP not
>> to be 7 (infinite retry), I'm expecting the sender to get an erroneous
>> completion with IBV_WC_RNR_RETRY_EXC_ERR.
>>
>> So far so good ... however, the completion I pull out of the send_cq,
>> lists the opcode of the completion to be IBV_WC_RECV! Is this expected?
>>
>> I am using OFED 1.1 on dual Intel Xeon machines with Mellanox DDR HCAs
>> (two ports) and in MemFree mode. The distribution used is RH AS4 (Nahant
>> Update 3), with kernel version 2.6.17.7.
>>
>> If someone could explain this behavior, or suggest a workaround, it'd be
>> great.
>>
>> TIA,
>> Sayantan.
>>  
>>
> I toke the following text from the man pages that i wrote to the 
> libibverbs:
> "Not all wc attributes are always valid. If the  completion  status  is
>       other  than  IBV_WC_SUCCESS,  only the following attributes are 
> valid:
>       wr_id, status, qp_num, and vendor_err."
>
> In other words, the opcode is not valid if you have a completion with 
> error.
>
> Thanks
> Dotan

-- 
http://www.cse.ohio-state.edu/~surs


From krkumar2 at in.ibm.com  Sun Dec  3 19:44:57 2006
From: krkumar2 at in.ibm.com (Krishna Kumar)
Date: Mon, 04 Dec 2006 09:14:57 +0530
Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in
	c2_qp_modify.
Message-ID: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>

vq_req is leaked in error cases.

Signed-off-by: Krishna Kumar <krkumar2 at in.ibm.com>
---
diff -ruNp org/drivers/infiniband/hw/amso1100/c2_qp.c new/drivers/infiniband/hw/amso1100/c2_qp.c
--- org/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-15 12:40:04.000000000 +0530
+++ new/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-16 18:10:03.000000000 +0530
@@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
 
 	if (attr_mask & IB_QP_STATE) {
 		/* Ensure the state is valid */
-		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
-			return -EINVAL;
+		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) {
+			err = -EINVAL;
+			goto bail0;
+		}
 
 		wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state));
 
@@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
 		if (attr->cur_qp_state != IB_QPS_RTR &&
 		    attr->cur_qp_state != IB_QPS_RTS &&
 		    attr->cur_qp_state != IB_QPS_SQD &&
-		    attr->cur_qp_state != IB_QPS_SQE)
-			return -EINVAL;
-		else
+		    attr->cur_qp_state != IB_QPS_SQE) {
+			err = -EINVAL;
+			goto bail0;
+		} else
 			wr.next_qp_state =
 			    cpu_to_be32(to_c2_state(attr->cur_qp_state));
 

From mst at mellanox.co.il  Mon Dec  4 00:59:45 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 10:59:45 +0200
Subject: [openib-general] CMA issue: SDP login compliancy
Message-ID: <20061204085945.GC20943@mellanox.co.il>

Hi!
SDP compliance statement *requires* that a consumer checks the
Responder Resources field in the connection Request/Response,
verifying that it is > 0. This is part of CA 4-41 in the spec.

However Responder Resources field does not seem to be exposed by the CMA API.  I
think knowing this value (at least in REQ, but preferably in REP is well) is
also important for any ULP that does RDMA reads.

Should/can CMA/UCMA be extended to pass this to the user? This might be
something we need to address before UCMA merge to avoid ABI breakage later.

-- 
MST


From monis at voltaire.com  Mon Dec  4 01:11:33 2006
From: monis at voltaire.com (Moni Shoua)
Date: Mon, 04 Dec 2006 11:11:33 +0200
Subject: [openib-general] [PATCH v2] IB_mthca HCA profile module
	parameters
In-Reply-To: <ada7ixgxjuw.fsf@cisco.com>
References: <Pine.LNX.4.64.0611162037360.31011@ain.voltaire.com>
	<adaejs3qhex.fsf@cisco.com> <456336AC.2070803@voltaire.com>
	<ada7ixgxjuw.fsf@cisco.com>
Message-ID: <4573E645.8050806@voltaire.com>

Roland Dreier wrote:
> OK, getting better, but still not there:
> 
>  > +	if (mthca_is_memfree(mdev)) {
>  > +		mthca_check_profile_and_warn(num_udav,default_profile.num_udav,
>  > +							 MTHCA_DEFAULT_NUM_UDAV);
>  > +		mthca_check_profile_and_warn(fmr_reserved_mtts,default_profile.fmr_reserved_mtts,
>  > +							 MTHCA_DEFAULT_NUM_RESERVED_MTTS);
> 
> fmr_reserved_mtts is actually only used for NON-memfree HCAs.
> 
> Also, coding style is spaces after commas, so you should do
> 
> 	foo(x, y, z);
> 
> instead of
> 
> 	foo(x,y,z);
> 
> I think it will be easier if I just try to fix this patch up myself
> instead of iterating like this...
> 
Hi,
It is fine with me if you make the necessary changes in this patch to make it acceptable.
I would appericaite if you send publish it after that to let me learn about the changes you made.
thanks and sorry for the delay in replying.

MoniS


From mst at mellanox.co.il  Mon Dec  4 02:01:32 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 12:01:32 +0200
Subject: [openib-general] QUESTION: git-tag howto
In-Reply-To: <OF72F8CE20.BA25CE55-ONC125723A.00360B82-C125723A.00364D98@de.ibm.com>
References: <OF72F8CE20.BA25CE55-ONC125723A.00360B82-C125723A.00364D98@de.ibm.com>
Message-ID: <20061204100132.GE20943@mellanox.co.il>

Good question.  There are different kinds of tags.
For OFED we used lightweigh tags so far.
Linus uses signed tags for Linux releases.

What *should* we use for OFED long term?
Comments wellcome.

Summary below.

----


Lightweigh tag is just a reference, not a real object.
As such, they do not have a name and are not immutable.

you create them with
git tag <name>

Tag objects have a name and so are immutable.

These are created with
git tag -a <name>
for unsigned tags

and

git tag -s/-u
for signed tags.


Quoting r. Hoang-Nam Nguyen <HNGUYEN at de.ibm.com>:
Subject: QUESTION: git-tag howto

Hi Michael!
Can you please give me some advices how to git-tag? Do I really need
to generate a gpg key for that? It would be great if you can describe
me briefly your git-tag procedure.
Thanks!
Nam

-- 
MST


From ogerlitz at voltaire.com  Mon Dec  4 02:55:23 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 04 Dec 2006 12:55:23 +0200
Subject: [openib-general] Local SA caching - why we need it
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C013C66D4@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C013C66D4@orsmsx418.amr.corp.intel.com>
Message-ID: <4573FE9B.20400@voltaire.com>

Woodruff, Robert J wrote:
> This really is not an issue with the Intel MPI connection establishment
> design, rather, any application (or set of applications) that needs to
> establish lots of connections will have the same issue. 

This is over simplification: you were mentioning that in your testing 
the SA scaled to 15K queries/second, lets set that limit to 10K.

**if** your (anyone's) MPI makes sure it does not impose on the SA a 
load of more then 10K queries per seconds, it would take 100 seconds for 
the SA to provide 1M paths to a 1K process job with 1K paths/for/each rank.

This somehow simple design change reduces your job start time from 
today's infinite to 100 seconds plus the time it takes to do all the 
other work (IP2GID resolution && QP create/modify-init-rtr-rts &&
CM exchanges && your-mpi-etcs).

The local-sa shrinks the paths-fetching-time from 100 seconds to zero 
and your startup code time would reduce to the "other" time.

So now you are either very happy, or just hit the next roadblock since 
the "other" time is not negligible.

In the devcon and elsewhere on this thread i was trying to say this and 
mention some ideas re the next roadblock, not sure why you did not want 
to hear it. ==================================================

When a closed source product sets requirements on open source software 
they should be willing to discuss some/of/the actual design and 
implementation of their SW. Specifically, when competing SW products 
(specifically open source ones) are claimed to need the exact or similar 
set of functionalities, you should be willing to have a discussion. They 
might even get some good advice for free...

The local SA was not developed for Intel MPI needs, but rather in the 
framework of the path-forward project, for future open MPI usage and/or 
other requirements of the labs (routing algorithms/visualization etc).

With this at hand, and your instant request to include it in OFED 1.2
the group here thinks that a local/distributed SA can be quite good 
solution for the roadblock you are hitting and does not disagree to 
include it in OFED 1.2 in the non disturbing form you were mentioning.

But, having what seems to be a trend of more MPIs which are now in or 
soon to be in a transition towards moving to use the RDMA CM for their 
job start, I would ask to hold off with a kernel push, to first have the 
local sa solution tested and more over see what problems are seen by 
other (and yours) MPIs when attempting to scale over Infiniband.

Or.


From johnpol at 2ka.mipt.ru  Mon Dec  4 03:08:26 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Mon, 4 Dec 2006 14:08:26 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061202224958.27014.65970.stgit@dell3.ogc.int>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
Message-ID: <20061204110825.GA26251@2ka.mipt.ru>

On Sat, Dec 02, 2006 at 04:49:58PM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> +static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
> +{
> +	struct cpl_close_con_req *req;
> +	struct sk_buff *skb;
> +
> +	PDBG("%s ep %p\n", __FUNCTION__, ep);
> +	skb = get_skb(NULL, sizeof(*req), gfp);
> +	if (!skb) {
> +		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
> +		return -ENOMEM;
> +	}
> +	skb->priority = CPL_PRIORITY_DATA;
> +	set_arp_failure_handler(skb, arp_failure_discard);
> +	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
> +	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
> +	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
> +	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
> +	l2t_send(ep->com.tdev, skb, ep->l2t);
> +	return 0;
> +}
> +
> +static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
> +{
> +	struct cpl_abort_req *req;
> +
> +	PDBG("%s ep %p\n", __FUNCTION__, ep);
> +	skb = get_skb(skb, sizeof(*req), gfp);
> +	if (!skb) {
> +		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
> +		       __FUNCTION__);
> +		return -ENOMEM;
> +	}
> +	skb->priority = CPL_PRIORITY_DATA;
> +	set_arp_failure_handler(skb, abort_arp_failure);
> +	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
> +	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ));
> +	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
> +	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
> +	req->cmd = CPL_ABORT_SEND_RST;
> +	l2t_send(ep->com.tdev, skb, ep->l2t);
> +	return 0;
> +}
> +
> +static int send_connect(struct iwch_ep *ep)
> +{
> +	struct cpl_act_open_req *req;
> +	struct sk_buff *skb;
> +	u32 opt0h, opt0l, opt2;
> +	unsigned int mtu_idx;
> +	int wscale;
> +
> +	PDBG("%s ep %p\n", __FUNCTION__, ep);
> +
> +	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
> +	if (!skb) {
> +		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
> +		       __FUNCTION__);
> +		return -ENOMEM;
> +	}
> +	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
> +	wscale = compute_wscale(rcv_win);
> +	opt0h = V_NAGLE(0) |
> +	    V_NO_CONG(nocong) |
> +	    V_KEEP_ALIVE(1) |
> +	    F_TCAM_BYPASS |
> +	    V_WND_SCALE(wscale) |
> +	    V_MSS_IDX(mtu_idx) |
> +	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
> +	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
> +	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
> +	skb->priority = CPL_PRIORITY_SETUP;
> +	set_arp_failure_handler(skb, act_open_req_arp_failure);
> +
> +	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
> +	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
> +	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
> +	req->local_port = ep->com.local_addr.sin_port;
> +	req->peer_port = ep->com.remote_addr.sin_port;
> +	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
> +	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
> +	req->opt0h = htonl(opt0h);
> +	req->opt0l = htonl(opt0l);
> +	req->params = 0;
> +	req->opt2 = htonl(opt2);
> +	l2t_send(ep->com.tdev, skb, ep->l2t);
> +	return 0;
> +}

...

> +static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
> +{
> +	struct iwch_ep *ep = ctx;
> +	struct cpl_act_establish *req = cplhdr(skb);
> +	unsigned int tid = GET_TID(req);
> +
> +	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid);
> +
> +	dst_confirm(ep->dst);
> +
> +	/* setup the hwtid for this connection */
> +	ep->hwtid = tid;
> +	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
> +
> +	ep->snd_seq = ntohl(req->snd_isn);
> +
> +	set_emss(ep, ntohs(req->tcp_opt));
> +
> +	/* dealloc the atid */
> +	cxgb3_free_atid(ep->com.tdev, ep->atid);
> +
> +	/* start MPA negotiation */
> +	send_mpa_req(ep, skb);
> +
> +	return 0;
> +}
> +
> +static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb)
> +{
> +	PDBG("%s ep %p\n", __FILE__, ep);
> +	state_set(&ep->com, ABORTING);
> +	send_abort(ep, skb, GFP_KERNEL);
> +}

Could you convince network core developers that it is not own TCP
implementation which will mess with existing one?

This and a lot of other changes in this driver definitely says you
implement your own stack of protocols on top of infiniband hardware.

-- 
	Evgeniy Polyakov


From mst at mellanox.co.il  Mon Dec  4 04:43:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 14:43:04 +0200
Subject: [openib-general] Local SA caching - why we need it
In-Reply-To: <4FB1BCCAE6CAED44A1DC005B1DE0611913E075@EPEXCH2.qlogic.org>
References: <4FB1BCCAE6CAED44A1DC005B1DE0611913E075@EPEXCH2.qlogic.org>
Message-ID: <20061204124304.GB31314@mellanox.co.il>

> A well designed SA cache/replica can use the assorted InformInfo notices
> from the SM to detect when GIDs come and go and hence properly update
> the relevant subset of its replica.

OK, but its not clear that keeping local sa cache per node is the answer.
Specifically it seems to be good for some workloads/topologies, but
worst case numbers do not look good to me:

It seems that (especially at startup time), the number of notifications
sent would be linear with the network size N.
Since all N nodes need to be notified, we get O(N^2) notifications instead of
O(N^2) queries, which does not seem to be a win - especially if you consider
that queries can be on demand while notifications aren't.

For example, a design using SA redirection with O(\sqrt N) SA replicas would give you
O(\sqrt N) notifications per replica and this way we would get O(N \sqrt N)
notifications and O(N \sqrt N) queries.
This would also move the code out from kernel to userspace SA.

I also note that the local_sa design from here:
https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/core/local_sa.c
does not seem to implement any InformInfo notices, which seems to guarantee
query storms with low cache timeout values, and connection timeouts on
topology changes with high cache timeout values.

-- 
MST


From mst at mellanox.co.il  Mon Dec  4 06:22:14 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 16:22:14 +0200
Subject: [openib-general] oops with multicast patches
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
Message-ID: <20061204142214.GA5426@mellanox.co.il>

OK, I got back to this finally. First, I reproduced the crash again,
with spinlock debugger enabled. It seems we are looking at some use-after-free.
Next, I'll try adding the debugging patch Sean posted, and see what this gives.

> When running the test ib_mcast_full, both of the hosts (10.4.10.136-137 ) got kernel oops (see below).
> This test first restart the driver, and after that it attached to the max available multicast groups.

BUG: spinlock bad magic on CPU#1, ib_mad2/15709
Unable to handle kernel paging request at 00000001003e0107 RIP:
<ffffffff802e0280>{spin_bug+116}
PGD 75f1a067 PUD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in: mst_pciconf mst_pci ib_mthca ib_umad ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod button battery ac ohci_hcd i2c_amd8111 i2c_amd756 i2c_core tg3 floppy ext3 jbd
Pid: 15709, comm: ib_mad2 Not tainted 2.6.17.9-smp #1
RIP: 0010:[<ffffffff802e0280>] <ffffffff802e0280>{spin_bug+116}
RSP: 0018:ffff810043b43ca8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 00000001003e0003 RCX: ffffffff80438e07
RDX: ffffffff80480e18 RSI: 0000000000000046 RDI: ffffffff80480e00
RBP: ffff81007a1cd840 R08: 00000000ffffffff R09: 0000000000000004
R10: 0000000100000000 R11: 0000000000000046 R12: ffff81007a1cd838
R13: 0000000000000293 R14: 0000000000000000 R15: ffffffff880732ce
FS:  00002b76b6c7e4e0(0000) GS:ffff81007df2eac0(0000) knlGS:00000000f7fd78e0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000001003e0107 CR3: 0000000075d1e000 CR4: 00000000000006e0
Process ib_mad2 (pid: 15709, threadinfo ffff810043b42000, task ffff81007d047880)
Stack: 0000000000000003 ffff81007a1cd840 ffff81007a1cd840 ffffffff802e02cd
       ffff81007cfe9600 ffff81007a1cd840 ffff81007a1cd838 ffffffff8040ee4b
       0000000000000246 ffffffff8807beff
Call Trace: <ffffffff802e02cd>{_raw_spin_lock+28} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
       <ffffffff8807beff>{:ib_sa:release_group+27} <ffffffff8807c903>{:ib_sa:mcast_work_handler+1280}
       <ffffffff802232bc>{find_busiest_group+304} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
       <ffffffff8807b8c3>{:ib_sa:ib_sa_mcmember_rec_callback+64}
       <ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff8040d976>{thread_return+100}
       <ffffffff8807bac4>{:ib_sa:send_handler+74} <ffffffff8807345b>{:ib_mad:timeout_sends+397}
       <ffffffff80238e94>{run_workqueue+161} <ffffffff80238ede>{worker_thread+0}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff80238fe3>{worker_thread+261}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff8023be5f>{kthread+200} <ffffffff8020a6aa>{child_rip+8}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff8023bd97>{kthread+0}
       <ffffffff8020a6a2>{child_rip+0}

Code: 44 8b 83 04 01 00 00 48 8d 8b a0 02 00 00 8b 55 04 41 89 c1
RIP <ffffffff802e0280>{spin_bug+116} RSP <ffff810043b43ca8>
CR2: 00000001003e0107
 <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1

Call Trace: <ffffffff80221da2>{__might_sleep+190} <ffffffff80236033>{blocking_notifier_call_chain+31}
       <ffffffff8022c29a>{do_exit+34} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
       <ffffffff802ee041>{vgacon_set_cursor_size+51} <ffffffff80410fdf>{do_page_fault+1852}
       <ffffffff880614b7>{:ib_core:ib_ud_header_pack+135} <ffffffff880b1420>{:ib_mthca:build_mlx_header+464}
       <ffffffff880732ce>{:ib_mad:timeout_sends+0} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
       <ffffffff8020a4f1>{error_exit+0} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
       <ffffffff802e0280>{spin_bug+116} <ffffffff802e026d>{spin_bug+97}
       <ffffffff802e02cd>{_raw_spin_lock+28} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
       <ffffffff8807beff>{:ib_sa:release_group+27} <ffffffff8807c903>{:ib_sa:mcast_work_handler+1280}
       <ffffffff802232bc>{find_busiest_group+304} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
       <ffffffff8807b8c3>{:ib_sa:ib_sa_mcmember_rec_callback+64}
       <ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff8040d976>{thread_return+100}
       <ffffffff8807bac4>{:ib_sa:send_handler+74} <ffffffff8807345b>{:ib_mad:timeout_sends+397}
       <ffffffff80238e94>{run_workqueue+161} <ffffffff80238ede>{worker_thread+0}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff80238fe3>{worker_thread+261}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff8023be5f>{kthread+200} <ffffffff8020a6aa>{child_rip+8}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff8023bd97>{kthread+0}
       <ffffffff8020a6a2>{child_rip+0}


Triggering sysrq show that this was during module removal:

modprobe      D ffff810055b1be48     0 22388      1                4048 (NOTLB)
ffff810055b1be48 0000000155b1bdc0 ffff81007a1c9880 000000000000050e
       000002b14226ffda ffff81007c1888c0 0000000000000001 0000000155b1be88
       ffff81007a1c9880 0000000000001707
Call Trace: <ffffffff8040dae3>{wait_for_completion+229}
       <ffffffff8040daa3>{wait_for_completion+165} <ffffffff80223e8f>{default_wake_functio
n+0}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8807cefe>{:ib_sa:mcast_cleanup
+25}
       <ffffffff8807cf12>{:ib_sa:ib_sa_cleanup+6} <ffffffff80241141>{sys_delete_module+411
}
       <ffffffff802dce63>{__up_write+20} <ffffffff8025c501>{sys_munmap+91}
       <ffffffff8020961a>{system_call+126}


-- 
Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd.


From jlentini at netapp.com  Mon Dec  4 06:50:09 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 4 Dec 2006 09:50:09 -0500 (EST)
Subject: [openib-general] NFS/RDMA for Linux: client and server update
 release 7
In-Reply-To: <EXNANE01tSgcOrBLSAQ00000180@exnane01.hq.netapp.com>
References: <EXNANE012LTrpwACkWH0000017e@exnane01.hq.netapp.com>
	<aday7prky6u.fsf@cisco.com>
	<EXNANE01tSgcOrBLSAQ00000180@exnane01.hq.netapp.com>
Message-ID: <Pine.LNX.4.64.0612040947530.20062@jlentini-linux.nane.netapp.com>


At 06:12 PM 12/1/2006, Roland Dreier wrote:
> What is the status of moving this code towards merging to the 
> upstream kernel?

I covered this last month as part of my OFA Summit presentation. The 
slides are available here:

http://openfabrics.org/conference/nov2006sc/ofa_summit_nfs_rdma.pdf


From halr at voltaire.com  Mon Dec  4 06:50:04 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Dec 2006 09:50:04 -0500
Subject: [openib-general] IPoIB and MC Group leaving
Message-ID: <1165243803.25587.5906.camel@hal.voltaire.com>

Roland,

Currently, the IPoIB code issues what I would term a "preemptive" leave
to the SA in a number of cases:

ulp/ipoib/ipoib_multicast.c:ipoib_mcast_leave
...
        /*
         * Just make one shot at leaving and don't wait for a reply;
         * if we fail, too bad.
         */
        ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec,
                                        IB_SA_MCMEMBER_REC_MGID         |
                                        IB_SA_MCMEMBER_REC_PORT_GID     |
                                        IB_SA_MCMEMBER_REC_PKEY         |
                                        IB_SA_MCMEMBER_REC_JOIN_STATE,
                                        0, GFP_ATOMIC, NULL,
                                        mcast, &mcast->query);

This is to make sure node is not registered in any groups. This leave
may not be successful. Failure is "normal" when the subnet is starting
up "fresh". There are other cases where the failure is indeed a failure.

However, it is "unsafe" to issue a subsequent join until the leave has
been responded to as that is the only "reliability" guarantee that the
SA has received the request and processed it. I know the comment says
that the result of the leave is irrelevant. However, the fact that it
has been processed or not is needed for the subsequent (related) join to
be issued. Pipelining of joins/leaves can only occur if they are
unrelated. I'm not sure the IBA spec is clear on this. Am I wrong about
this ? 

-- Hal


From eitan at mellanox.co.il  Mon Dec  4 07:02:38 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 04 Dec 2006 17:02:38 +0200
Subject: [openib-general] IPoIB and MC Group leaving
In-Reply-To: <1165243803.25587.5906.camel@hal.voltaire.com>
References: <1165243803.25587.5906.camel@hal.voltaire.com>
Message-ID: <4574388E.8020207@mellanox.co.il>

Actually I do not see the point in leaving all groups and immediately 
joining them again.

Hal Rosenstock wrote:
> Roland,
>
> Currently, the IPoIB code issues what I would term a "preemptive" leave
> to the SA in a number of cases:
>
> ulp/ipoib/ipoib_multicast.c:ipoib_mcast_leave
> ...
>         /*
>          * Just make one shot at leaving and don't wait for a reply;
>          * if we fail, too bad.
>          */
>         ret = ib_sa_mcmember_rec_delete(priv->ca, priv->port, &rec,
>                                         IB_SA_MCMEMBER_REC_MGID         |
>                                         IB_SA_MCMEMBER_REC_PORT_GID     |
>                                         IB_SA_MCMEMBER_REC_PKEY         |
>                                         IB_SA_MCMEMBER_REC_JOIN_STATE,
>                                         0, GFP_ATOMIC, NULL,
>                                         mcast, &mcast->query);
>
> This is to make sure node is not registered in any groups. This leave
> may not be successful. Failure is "normal" when the subnet is starting
> up "fresh". There are other cases where the failure is indeed a failure.
>
> However, it is "unsafe" to issue a subsequent join until the leave has
> been responded to as that is the only "reliability" guarantee that the
> SA has received the request and processed it. I know the comment says
> that the result of the leave is irrelevant. However, the fact that it
> has been processed or not is needed for the subsequent (related) join to
> be issued. Pipelining of joins/leaves can only occur if they are
> unrelated. I'm not sure the IBA spec is clear on this. Am I wrong about
> this ? 
>
> -- Hal
>
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Mon Dec  4 07:26:24 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 17:26:24 +0200
Subject: [openib-general] oops with multicast patches
In-Reply-To: <20061204142214.GA5426@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
Message-ID: <20061204152624.GA8269@mellanox.co.il>

> OK, I got back to this finally. First, I reproduced the crash again,
> with spinlock debugger enabled. It seems we are looking at some use-after-free.
> Next, I'll try adding the debugging patch Sean posted, and see what this gives.

Sean, Yohad here tried adding your debugging patch and reproduced the crash.
Unfortunately, none of the BUG_ON errors got triggered.
Here's the trace from the last crash:

BUG: spinlock bad magic on CPU#1, ib_mad2/17805
 lock: ffff810079fc4140, .magic: 00000000, .owner: /-32512, .owner_cpu: 2039181760

Call Trace: <ffffffff802e02cd>{_raw_spin_lock+28} <ffffffff8040ee4b>{_spin_lock_irqsave+11}
       <ffffffff8807beff>{:ib_sa:release_group+27} <ffffffff8807c95a>{:ib_sa:mcast_work_handler+1345}
       <ffffffff880724de>{:ib_mad:ib_mad_post_receive_mads+268}
       <ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff880732ce>{:ib_mad:timeout_sends+0}
       <ffffffff8807b8c3>{:ib_sa:ib_sa_mcmember_rec_callback+64}
       <ffffffff8040eef7>{_spin_unlock_irq+7} <ffffffff8040d976>{thread_return+100}
       <ffffffff8807bac4>{:ib_sa:send_handler+74} <ffffffff8807345b>{:ib_mad:timeout_sends+397}
       <ffffffff80238e94>{run_workqueue+161} <ffffffff80238ede>{worker_thread+0}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff80238fe3>{worker_thread+261}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff80223e8f>{default_wake_function+0} <ffffffff8023be88>{keventd_create_kthread+0}
       <ffffffff8023be5f>{kthread+200} <ffffffff8020a6aa>{child_rip+8}
       <ffffffff8023be88>{keventd_create_kthread+0} <ffffffff8023bd97>{kthread+0}
       <ffffffff8020a6a2>{child_rip+0}

-- 
MST


From rdreier at cisco.com  Mon Dec  4 07:45:52 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 07:45:52 -0800
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061204110825.GA26251@2ka.mipt.ru> (Evgeniy Polyakov's
	message of "Mon, 4 Dec 2006 14:08:26 +0300")
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru>
Message-ID: <ada8xhnk6kv.fsf@cisco.com>

 > Could you convince network core developers that it is not own TCP
 > implementation which will mess with existing one?

I'm not qualified to comment on this...

 > This and a lot of other changes in this driver definitely says you
 > implement your own stack of protocols on top of infiniband hardware.

...but I do know this driver is for 10-gig ethernet HW.

 - R.


From rdreier at cisco.com  Mon Dec  4 07:49:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 07:49:20 -0800
Subject: [openib-general] IPoIB and MC Group leaving
In-Reply-To: <1165243803.25587.5906.camel@hal.voltaire.com> (Hal
	Rosenstock's message of "04 Dec 2006 09:50:04 -0500")
References: <1165243803.25587.5906.camel@hal.voltaire.com>
Message-ID: <ada4psbk6f3.fsf@cisco.com>

 > This is to make sure node is not registered in any groups. This leave
 > may not be successful. Failure is "normal" when the subnet is starting
 > up "fresh". There are other cases where the failure is indeed a failure.

As far as I know, IPoIB will not leave a group unless it thinks it has
joined the group.  What is the code path for a "preemptive" leave?

 - R.


From rdreier at cisco.com  Mon Dec  4 07:51:26 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 07:51:26 -0800
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <20061203124623.GA15614@mellanox.co.il> (Michael S.
	Tsirkin's message of "Sun, 3 Dec 2006 14:46:23 +0200")
References: <ada7ixdq0x8.fsf@cisco.com> <20061203124623.GA15614@mellanox.co.il>
Message-ID: <adavekrirr5.fsf@cisco.com>

 > >       IB/ucm: Fix deadlock in cleanup

 > Can this go into -stable for 2.6.18.x?

Yes.  If you can send to stable@ that would be great.


From halr at voltaire.com  Mon Dec  4 08:01:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 04 Dec 2006 11:01:28 -0500
Subject: [openib-general] IPoIB and MC Group leaving
In-Reply-To: <ada4psbk6f3.fsf@cisco.com>
References: <1165243803.25587.5906.camel@hal.voltaire.com>
	<ada4psbk6f3.fsf@cisco.com>
Message-ID: <1165248082.25587.8839.camel@hal.voltaire.com>

On Mon, 2006-12-04 at 10:49, Roland Dreier wrote:
>  > This is to make sure node is not registered in any groups. This leave
>  > may not be successful. Failure is "normal" when the subnet is starting
>  > up "fresh". There are other cases where the failure is indeed a failure.
> 
> As far as I know, IPoIB will not leave a group unless it thinks it has
> joined the group.  What is the code path for a "preemptive" leave?

OK maybe I have that part wrong but what about the other part:

The fact that a leave doesn't wait for the response and then a join is
issued. I think there is a race condition here perhaps triggered by
client reregistration.

-- Hal

>  - R.


From swise at opengridcomputing.com  Mon Dec  4 08:20:51 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 04 Dec 2006 10:20:51 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <ada8xhnk6kv.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
Message-ID: <1165249251.32724.26.camel@stevo-desktop>

On Mon, 2006-12-04 at 07:45 -0800, Roland Dreier wrote:
>  > Could you convince network core developers that it is not own TCP
>  > implementation which will mess with existing one?
> 
> I'm not qualified to comment on this...
> 

I don't understand your question?

>  > This and a lot of other changes in this driver definitely says you
>  > implement your own stack of protocols on top of infiniband hardware.
> 
> ...but I do know this driver is for 10-gig ethernet HW.
> 

There is no SW TCP stack in this driver.  The HW supports RDMA over
TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
(aka iWARP).  The device is a 10 GbE device, not Infiniband.  The
Ethernet driver, upon which the rdma driver depends, acts both like a
traditional Ethernet NIC for the Linux stack as well as a TCP offload
device for the RDMA driver allowing establishment of RDMA connections.
The Connection Manager (patch 04/13) sends/receives messages from the
Ethernet driver that sets up HW TCP connections for doing RDMA.  While
this is indeed implementing TCP offload, it is _not_ integrating it with
the sockets layer nor the linux stack and offloading sockets
connections.  Its only supporting offload connections for the RDMA
driver to do iWARP.   The Ammasso device is another example of this
(drivers/infiniband/hw/amso1100).  Deep iSCSI adapters are another
example of this.


Steve.


From swise at opengridcomputing.com  Mon Dec  4 08:24:34 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 04 Dec 2006 10:24:34 -0600
Subject: [openib-general] [PATCH v2 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <4572194F.8060309@osdl.org>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202231329.GA10719@electric-eye.fr.zoreil.com>
	<4572194F.8060309@osdl.org>
Message-ID: <1165249474.32724.30.camel@stevo-desktop>

> >>     
> >
> > I understood that Stephen expressed some doubts regarding the inclusion
> > of TOE enabled features.
> >
> > Was his point addressed ?
> >
> >   
> 
> My comments were about different hardware.


Just to clarify:  

Stephen is working on the Chelsio T2 HW driver.  

The drivers Divy and I are submitting are for the new Chelsio T3
hardware.  Two drivers are being submitted:  The Ethernet driver
(submitted by Divy) and the RDMA driver (submitted by me) which requires
the Ethernet driver.  The RDMA driver will live in
drivers/infiniband/hw/cxgb3 and the Ethernet driver will live in
drivers/net/cxgb3.


Steve.


From swise at opengridcomputing.com  Mon Dec  4 08:28:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 04 Dec 2006 10:28:25 -0600
Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data
	Structures
In-Reply-To: <1165147639.3233.211.camel@laptopd505.fenrus.org>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224947.27014.59189.stgit@dell3.ogc.int>
	<1165147639.3233.211.camel@laptopd505.fenrus.org>
Message-ID: <1165249706.32724.35.camel@stevo-desktop>

On Sun, 2006-12-03 at 13:07 +0100, Arjan van de Ven wrote:
> On Sat, 2006-12-02 at 16:49 -0600, Steve Wise wrote:
> 
> > +
> > +static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
> > +				    struct ib_ah_attr *ah_attr)
> > +{
> > +	return ERR_PTR(-ENOSYS);
> > +}
> 
> 
> -ENOSYS is just about ALWAYS a bug in that it's guaranteed to be the
> wrong error code ;)

This is a method that is not supported by the iWARP transport.  ENOSYS
indicates this.  I _think_ this is SOP for the infinband subsystem.

Roland, I think at one time we were talking about changing the Core to
better handle this?  Either with attributes/capabilities that the low
level driver can set, or by set these method ptrs to NULL and the core
should handle it in the wrapper function...


Steve.


From rdreier at cisco.com  Mon Dec  4 08:45:30 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 08:45:30 -0800
Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data
	Structures
In-Reply-To: <1165249706.32724.35.camel@stevo-desktop> (Steve Wise's
	message of "Mon, 04 Dec 2006 10:28:25 -0600")
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224947.27014.59189.stgit@dell3.ogc.int>
	<1165147639.3233.211.camel@laptopd505.fenrus.org>
	<1165249706.32724.35.camel@stevo-desktop>
Message-ID: <adaodqjip91.fsf@cisco.com>

 > Roland, I think at one time we were talking about changing the Core to
 > better handle this?  Either with attributes/capabilities that the low
 > level driver can set, or by set these method ptrs to NULL and the core
 > should handle it in the wrapper function...

Yes, it would make sense to change the midlayer so we have different
sets of mandatory functions for IB and iWARP drivers.  For example,
the iwcm functions probably should be mandatory for iWARP devices, right?

 - R.


From mst at mellanox.co.il  Mon Dec  4 08:44:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 18:44:48 +0200
Subject: [openib-general] [PATCH -stable] IB/ucm: Fix deadlock in cleanup
In-Reply-To: <20060403154741.GB14808@mellanox.co.il>
References: <20060403154741.GB14808@mellanox.co.il>
Message-ID: <20061204164448.GA15375@mellanox.co.il>

ib_ucm_cleanup_events() holds file_mutex while calling ib_destroy_cm_id().
This can deadlock since ib_destroy_cm_id() flushes event handlers, and
ib_ucm_event_handler() needs file_mutex, too.  Therefore, drop the
file_mutex during the call to ib_destroy_cm_id().

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
Signed-off-by: Roland Dreier <rolandd at cisco.com>
Acked-by: Sean Hefty <sean.hefty at intel.com>

---

Hello, -stable team!
This patch backports commit f469b2626f48829c06e40ac799c1edf62b12048e to 2.6.19.
Please consider it for 2.6.19.y - this fixes a deadlock reproduced here at Mellanox.

diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 1f4f2d2..f15220a 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -161,12 +161,14 @@ static void ib_ucm_cleanup_events(struct ib_ucm_context *ctx)
 				    struct ib_ucm_event, ctx_list);
 		list_del(&uevent->file_list);
 		list_del(&uevent->ctx_list);
+		mutex_unlock(&ctx->file->file_mutex);
 
 		/* clear incoming connections. */
 		if (ib_ucm_new_cm_id(uevent->resp.event))
 			ib_destroy_cm_id(uevent->cm_id);
 
 		kfree(uevent);
+		mutex_lock(&ctx->file->file_mutex);
 	}
 	mutex_unlock(&ctx->file->file_mutex);
 }

-- 
MST


From swise at opengridcomputing.com  Mon Dec  4 08:50:48 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 04 Dec 2006 10:50:48 -0600
Subject: [openib-general] [PATCH v2 03/13] Provider Methods and Data
	Structures
In-Reply-To: <adaodqjip91.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224947.27014.59189.stgit@dell3.ogc.int>
	<1165147639.3233.211.camel@laptopd505.fenrus.org>
	<1165249706.32724.35.camel@stevo-desktop> <adaodqjip91.fsf@cisco.com>
Message-ID: <1165251048.32724.37.camel@stevo-desktop>

On Mon, 2006-12-04 at 08:45 -0800, Roland Dreier wrote:
>  > Roland, I think at one time we were talking about changing the Core to
>  > better handle this?  Either with attributes/capabilities that the low
>  > level driver can set, or by set these method ptrs to NULL and the core
>  > should handle it in the wrapper function...
> 
> Yes, it would make sense to change the midlayer so we have different
> sets of mandatory functions for IB and iWARP drivers.  For example,
> the iwcm functions probably should be mandatory for iWARP devices, right?
> 

Yes. The iWARP devices must all support the iwcm methods for sure.


From mst at mellanox.co.il  Mon Dec  4 08:57:12 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 18:57:12 +0200
Subject: [openib-general] CMA issue: SDP login compliancy
In-Reply-To: <20061204085945.GC20943@mellanox.co.il>
References: <20061204085945.GC20943@mellanox.co.il>
Message-ID: <20061204165712.GC15375@mellanox.co.il>

> Subject: CMA issue: SDP login compliancy
> 
> Hi!
> SDP compliance statement *requires* that a consumer checks the
> Responder Resources field in the connection Request/Response,
> verifying that it is > 0. This is part of CA 4-41 in the spec.
> 
> However Responder Resources field does not seem to be exposed by the CMA API.  I
> think knowing this value (at least in REQ, but preferably in REP is well) is
> also important for any ULP that does RDMA reads.
> 
> Should/can CMA/UCMA be extended to pass this to the user? This might be
> something we need to address before UCMA merge to avoid ABI breakage later.

Steve, could you please comment on the iWarp side of things?
Does iwarp connection get the number of RDMA read requests remote side
can support during connection setup?
Or is this IB-specific?

-- 
MST


From swise at opengridcomputing.com  Mon Dec  4 09:14:43 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 04 Dec 2006 11:14:43 -0600
Subject: [openib-general] CMA issue: SDP login compliancy
In-Reply-To: <20061204165712.GC15375@mellanox.co.il>
References: <20061204085945.GC20943@mellanox.co.il>
	<20061204165712.GC15375@mellanox.co.il>
Message-ID: <1165252483.32724.44.camel@stevo-desktop>

On Mon, 2006-12-04 at 18:57 +0200, Michael S. Tsirkin wrote:
> > Subject: CMA issue: SDP login compliancy
> > 
> > Hi!
> > SDP compliance statement *requires* that a consumer checks the
> > Responder Resources field in the connection Request/Response,
> > verifying that it is > 0. This is part of CA 4-41 in the spec.
> > 
> > However Responder Resources field does not seem to be exposed by the CMA API.  I
> > think knowing this value (at least in REQ, but preferably in REP is well) is
> > also important for any ULP that does RDMA reads.
> > 
> > Should/can CMA/UCMA be extended to pass this to the user? This might be
> > something we need to address before UCMA merge to avoid ABI breakage later.
> 
> Steve, could you please comment on the iWarp side of things?
> Does iwarp connection get the number of RDMA read requests remote side
> can support during connection setup?
> Or is this IB-specific?
> 

I believe Sean's latest CMA patches under consideration for 2.6.20
support this from a CMA perspective. 

See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580

iWARP (MPA protocol) currently doesn't exchange this information across
the wire at connection setup, but there are proposals in the works to
support this (It requires a wire protocol change).  So eventually, iWARP
will provide the remote peer's responder resources in the connection
events.

Steve.


From mst at mellanox.co.il  Mon Dec  4 09:37:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 19:37:35 +0200
Subject: [openib-general] CMA issue: SDP login compliancy
In-Reply-To: <1165252483.32724.44.camel@stevo-desktop>
References: <1165252483.32724.44.camel@stevo-desktop>
Message-ID: <20061204173735.GD15375@mellanox.co.il>

> I believe Sean's latest CMA patches under consideration for 2.6.20
> support this from a CMA perspective. 
> 
> See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580

Right, looks like it's covered there. Good, thanks.


-- 
MST


From mst at mellanox.co.il  Mon Dec  4 09:41:47 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 19:41:47 +0200
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <adavekrirr5.fsf@cisco.com>
References: <adavekrirr5.fsf@cisco.com>
Message-ID: <20061204174147.GE15375@mellanox.co.il>

>  > >       IB/ucm: Fix deadlock in cleanup
> 
>  > Can this go into -stable for 2.6.18.x?
> 
> Yes.  If you can send to stable@ that would be great.

I sent it for inclusion in 2.6.19.y.
I don't remember what is the timeframe for 2.6.18.x, exactly.
Is it still maintained now that 2.6.19 is out?

-- 
MST


From parks at lanl.gov  Mon Dec  4 09:58:47 2006
From: parks at lanl.gov (Parks Fields)
Date: Mon, 04 Dec 2006 10:58:47 -0700
Subject: [openib-general] Nvivia vs Serverworks chip set.
In-Reply-To: <20061204174147.GE15375@mellanox.co.il>
References: <adavekrirr5.fsf@cisco.com> <20061204174147.GE15375@mellanox.co.il>
Message-ID: <7.0.1.0.2.20061204105133.02877c88@lanl.gov>


Hello all,

Has anyone done any comparisons of the  Mellanox MHEA28-XTC

card using a motherboard with the serverworks vs Nvidia chipset.
I am most concerned with latency and IPoIB bandwidth. Also how a 
standard RH el/es 4.3 2.6.9 kernel VS a 2.6.17 or 18 kernel with the 
above chipsets.

thanks for any insight.

parks


                    ***** Correspondence *****

This email contains no programmatic content that requires independent 
ADC review  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061204/07fb9d97/attachment.html>

From mshefty at ichips.intel.com  Mon Dec  4 10:05:41 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 04 Dec 2006 10:05:41 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <20061204152624.GA8269@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
Message-ID: <45746375.5010107@ichips.intel.com>

Michael S. Tsirkin wrote:
> Sean, Yohad here tried adding your debugging patch and reproduced the crash.
> Unfortunately, none of the BUG_ON errors got triggered.
> Here's the trace from the last crash:

Okay... this will be difficult.  The state of the groups all appear to be 
correct then.  Can you provide more information on your setup?  I cannot 
reproduce this with my systems at all, even with kernel debugging enabled.  Do 
you apply any patches to ipoib beyond the patches that I posted?

- Sean


From mshefty at ichips.intel.com  Mon Dec  4 10:08:18 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 04 Dec 2006 10:08:18 -0800
Subject: [openib-general] IPoIB and MC Group leaving
In-Reply-To: <1165248082.25587.8839.camel@hal.voltaire.com>
References: <1165243803.25587.5906.camel@hal.voltaire.com>
	<ada4psbk6f3.fsf@cisco.com>
	<1165248082.25587.8839.camel@hal.voltaire.com>
Message-ID: <45746412.6080302@ichips.intel.com>

Hal Rosenstock wrote:
> The fact that a leave doesn't wait for the response and then a join is
> issued. I think there is a race condition here perhaps triggered by
> client reregistration.

Note that the multicast interface changes corrects for this.  We only need to 
understand why Mellanox sees crashes in that code...

- Sean


From mst at mellanox.co.il  Mon Dec  4 10:16:41 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 4 Dec 2006 20:16:41 +0200
Subject: [openib-general] oops with multicast patches
In-Reply-To: <45746375.5010107@ichips.intel.com>
References: <45746375.5010107@ichips.intel.com>
Message-ID: <20061204181641.GF15375@mellanox.co.il>

> > Sean, Yohad here tried adding your debugging patch and reproduced the crash.
> > Unfortunately, none of the BUG_ON errors got triggered.
> > Here's the trace from the last crash:
> 
> Okay... this will be difficult.  The state of the groups all appear to be 
> correct then.  Can you provide more information on your setup?  I cannot 
> reproduce this with my systems at all, even with kernel debugging enabled.  Do 
> you apply any patches to ipoib beyond the patches that I posted?

This seems to be reproducible with or without additional patches.
Not sure what's different with our setup here.

I think what Yohad was running in this instance was 2.6.17 with ipoib
from 2.6.19 + your patches, but there seems to be no trouble triggering this
on other kernels as well.

-- 
MST


From ralph.campbell at qlogic.com  Mon Dec  4 10:30:44 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Mon, 04 Dec 2006 10:30:44 -0800
Subject: [openib-general] [PATCH v2 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <45728E0F.9020106@voltaire.com>
References: <1164911024.14800.74.camel@brick.pathscale.com>
	<45728E0F.9020106@voltaire.com>
Message-ID: <1165257044.14800.112.camel@brick.pathscale.com>

On Sun, 2006-12-03 at 10:42 +0200, Or Gerlitz wrote:
> Ralph Campbell wrote:
> > This patch implements the interposing DMA mapping functions to allow
> > support for IOMMUs and remove the dependence on phys_to_virt().
> 
> > --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Wed Nov 29 13:55:07 2006 -0800
> > +/**
> > + * ipath_dma_map_single - Map a kernel virtual address to DMA address
> > + * @device: The device for which the dma_addr is to be created
> > + * @cpu_addr: The kernel virtual address
> > + * @size: The size of the region in bytes
> > + * @direction: The direction of the DMA
> > + */
> > +static u64 ipath_dma_map_single(struct ib_device *dev,
> > +			        void *cpu_addr, size_t size,
> > +			        enum dma_data_direction direction)
> > +{
> > +	BUG_ON(!valid_dma_direction(direction));
> > +	return (u64) cpu_addr;
> > +}
> 
> if ipath_dma_map_single is a NO OP
> 
> > +/**
> > + * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU
> > + * @device: The device for which the DMA address was created
> > + * @addr: The DMA address
> > + * @size: The size of the region in bytes
> > + * @dir: The direction of the DMA
> > + */
> > +static void ipath_sync_single_for_cpu(struct ib_device *dev,
> > +				      u64 addr,
> > +				      size_t size,
> > +				      enum dma_data_direction dir)
> > +{
> > +	dma_sync_single_for_cpu(dev->dma_device, addr, size, dir);
> > +}
> 
> then why ipath_sync_single_for_cpu does something? am i just pointing on 
> a cleanup or there's something more deep here?
> 
> Or.

Good catch. There is nothing going on here.
The dma_sync_single_* should be NOPs.


From jsquyres at cisco.com  Mon Dec  4 11:00:56 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Mon, 4 Dec 2006 14:00:56 -0500
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
Message-ID: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>

Who controls the DNS for openfabrics.org?  Could we get these names  
created?  Or -- are there any objections to creating / using such names?

Thanks!


On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote:

> The name "staging.openfabrics.org" was really intended to be  
> temporary until the old openfabrics.org was taken offline and  
> replaced with the new one.
>
> My $0.02 is that we should stop using staging.openfabrics.org as  
> soon as possible and create / start using some new names for the  
> server to allow for potential transparent service relocation someday.
>
> Here are some new name suggestions that could be done immediately  
> (with appropriate changes to DNS, apache config, ...and potentially  
> others):
>
>  * git.openfabrics.org: for all git activity
>  * wiki.openfabrics.org: a top-level name for the wiki rather than  
> burying it under several layers of links on the web site
>  * trac.openfabrics.org: if someone creates this name, I volunteer  
> to finally get off my butt and install trac to see if people like it
>
> These are the old names and would need to be changed in DNS only  
> when the old server is taken offline / we're ready to move to the  
> new server:
>
>  * openfabrics.org: redirect to www.openfabrics.org, and for mail  
> traffic
>  * www.openfabrics.org: main web site
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
>


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From ralph.campbell at qlogic.com  Mon Dec  4 11:17:24 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Mon, 04 Dec 2006 11:17:24 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <45728B6F.6040905@voltaire.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<45728B6F.6040905@voltaire.com>
Message-ID: <1165259844.14800.134.camel@brick.pathscale.com>

On Sun, 2006-12-03 at 10:31 +0200, Or Gerlitz wrote:
> Ralph Campbell wrote:
> >> On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
> >>> On Thu, 2006-11-30 at 12:10 -0800, Roland Dreier wrote:
> 
> >>>> So what did you change since v1?  How do you deal with fitting 64-bit
> >>>> addresses into an sg list entry that has a 32-bit dma_addr_t?
> 
> > Although the driver compiles on 32-bit kernels, it is unsupported
> > and never been tested. All known 64-bit systems don't define
> > CONFIG_HIGHMEM.  In spite of previous emails suggesting that
> > page_address() can return NULL without CONFIG_HIGHMEM defined,
> > the code in include/linux/mm.h doesn't allow it (assuming the
> > page pointer is valid and not some random address).
> > I verified this with Andrew Morton.
> 
> Can you provide the quote from include/linux/mm.h of the code that 
> disallows it? looking there i don't see the enforcement.
> 
> mmm, your consulting with Andrew Morton was not over this thread... well
> Christoph Hellwig comment on the V1 thread tells a different story:
> 
> Only for GFP_KERNEL allocations you can assume page_address is valid, 
> and the scatterlist passed to a SCSI LLDD can contain any type of pages. 
>   Currently on all 64bit architectures page_address works on all pages, 
> but that's an implementation detail that could change any time and that 
> you should not rely on.
> 
> see http://www.mail-archive.com/openib-general at openib.org/msg27132.html
> 
> As i have mentioned in the past, this (no kvaddr for a page) comes into 
> play when a SCSI LLD (eg iSER, SRP) gets DIRECT I/O or AIO (SDP) pages 
> from user space.
> 
> Or.

I appreciate your pointing out the potential problems.  I agree that
future kernel changes could certainly break existing drivers.  That
happens frequently even when following the guarantees.

I still don't understand how a valid struct page * (regardless of
whether it is mapped into user space or not) can not have a valid
kernel address when CONFIG_HIGHMEM is not defined for the current
source base.  In include/linux/mm.h, page_address() is defined as
lowmem_page_address() which is defined as
	__va(page_to_pfn(page) << PAGE_SHIFT)
which can only fail if there isn't a valid PFN for the page.
I don't see how that can happen.

If I am wrong, I would like to understand why.

If you have suggestions for fixing these issues,
please let me know.


From krause at cup.hp.com  Mon Dec  4 11:32:22 2006
From: krause at cup.hp.com (Michael Krause)
Date: Mon, 04 Dec 2006 11:32:22 -0800
Subject: [openib-general] CMA issue: SDP login compliancy
In-Reply-To: <1165252483.32724.44.camel@stevo-desktop>
References: <20061204085945.GC20943@mellanox.co.il>
	<20061204165712.GC15375@mellanox.co.il>
	<1165252483.32724.44.camel@stevo-desktop>
Message-ID: <6.2.0.14.2.20061204112936.083de870@esmail.cup.hp.com>

At 09:14 AM 12/4/2006, Steve Wise wrote:
>On Mon, 2006-12-04 at 18:57 +0200, Michael S. Tsirkin wrote:
> > > Subject: CMA issue: SDP login compliancy
> > >
> > > Hi!
> > > SDP compliance statement *requires* that a consumer checks the
> > > Responder Resources field in the connection Request/Response,
> > > verifying that it is > 0. This is part of CA 4-41 in the spec.
> > >
> > > However Responder Resources field does not seem to be exposed by the 
> CMA API.  I
> > > think knowing this value (at least in REQ, but preferably in REP is 
> well) is
> > > also important for any ULP that does RDMA reads.
> > >
> > > Should/can CMA/UCMA be extended to pass this to the user? This might be
> > > something we need to address before UCMA merge to avoid ABI breakage 
> later.
> >
> > Steve, could you please comment on the iWarp side of things?
> > Does iwarp connection get the number of RDMA read requests remote side
> > can support during connection setup?
> > Or is this IB-specific?
> >
>
>I believe Sean's latest CMA patches under consideration for 2.6.20
>support this from a CMA perspective.
>
>See http://thread.gmane.org/gmane.linux.drivers.openib/33576/focus=33580
>
>iWARP (MPA protocol) currently doesn't exchange this information across
>the wire at connection setup, but there are proposals in the works to
>support this (It requires a wire protocol change).  So eventually, iWARP
>will provide the remote peer's responder resources in the connection
>events.

SDP Hello exchanges the number of SrcAvail for each side of the 
communication in addition to other resource information - this provides the 
RDMA Read Request depth information.  I am not aware of any request to 
modify MPA which just completed last call in November.  The same type of 
information is exchanged during iSCSI login.  The consensus was since each 
ULP exchanges this information during their initial ULP-level 
communication, there was no reason to replicate this within MPA.

Mike 


From boris at mellanox.com  Mon Dec  4 14:30:26 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Mon, 4 Dec 2006 14:30:26 -0800
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
Message-ID: <1E3DCD1C63492545881FACB6063A57C16E40E8@mtiexch01.mti.com>

I guess we need to have all our recent MPI fixes to be added to the
support page.
Pasha should keep track of those, including the one I sent to Sun.

By the way, where is this support page exactly - on our web site ? 

Boris.

-----Original Message-----
From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
Sent: Sunday, December 03, 2006 5:50 AM
To: Boris Shpolyansky
Cc: David Costa; openib-general at openib.org; Robert Houk; Anthony
Vinciguerra; Thomas Babbit
Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

Boris Shpolyansky wrote:
> Hi David,
>  
> If you are using OFED-1.1 stack and OSU MVAPICH provided with the
> OFED-1.1 package as your MPI layer,
> the attached patch should solve your problem.
>  
> Please, let me know if that helped.
>  
> Regards,
>  
Boris,
Please add this to OFED 1.1 support page

Thanks,
Tziporet


From yuytwr at yahoo.co.jp  Mon Dec  4 15:50:36 2006
From: yuytwr at yahoo.co.jp (yuytwr at yahoo.co.jp)
Date: Tue, 5 Dec 2006 07:50:36 +0800
Subject: [openib-general] =?GB2312?B?zNi8r6Oh?=
Message-ID: <20061204235004.2D41C3B0001@sentry-two.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/cdb9bf16/attachment.html>

From rdreier at cisco.com  Mon Dec  4 20:12:28 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 20:12:28 -0800
Subject: [openib-general] [PATCH/RFC] busted request IRQ for PCIe ipath HCAs
Message-ID: <adafybvgevn.fsf@cisco.com>

I think commit 51f65ebc (fix HT IRQ setting on HT HCAs) busted ipath
on PCIe HCAs, since ipath_irq is set before pci_enable_msi(), which
means it gets some value unrelated to the actual IRQ that is assigned.
I needed the patch below to make 2.6.19 work with my PCIe HCAs.

Bryan/anyone at Qlogic, does this look right?  It worked for me, so if
this is what was intended, I will queue the patch for 2.6.20 and
submit to stable at kernel.org for 2.6.19.x.

 - R.

diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index 6af8968..498b596 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -851,8 +851,8 @@ static int ipath_setup_pe_config(struct
 	int pos, ret;
 
 	dd->ipath_msi_lo = 0;	/* used as a flag during reset processing */
-	dd->ipath_irq = pdev->irq;
 	ret = pci_enable_msi(dd->pcidev);
+	dd->ipath_irq = pdev->irq;
 	if (ret)
 		ipath_dev_err(dd, "pci_enable_msi failed: %d, "
 			      "interrupts may not work\n", ret);


From johnpol at 2ka.mipt.ru  Mon Dec  4 21:07:25 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 08:07:25 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <ada8xhnk6kv.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
Message-ID: <20061205050725.GA26033@2ka.mipt.ru>

On Mon, Dec 04, 2006 at 07:45:52AM -0800, Roland Dreier (rdreier at cisco.com) wrote:
>  > This and a lot of other changes in this driver definitely says you
>  > implement your own stack of protocols on top of infiniband hardware.
> 
> ...but I do know this driver is for 10-gig ethernet HW.

It is for iwarp/rdma from description.
If it is 10ge, then why does it parse incomping packet headers and
implements initial tcp state machine?

>  - R.

-- 
	Evgeniy Polyakov


From johnpol at 2ka.mipt.ru  Mon Dec  4 21:13:57 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 08:13:57 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165249251.32724.26.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
Message-ID: <20061205051356.GA26845@2ka.mipt.ru>

On Mon, Dec 04, 2006 at 10:20:51AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> >  > This and a lot of other changes in this driver definitely says you
> >  > implement your own stack of protocols on top of infiniband hardware.
> > 
> > ...but I do know this driver is for 10-gig ethernet HW.
> > 
> 
> There is no SW TCP stack in this driver.  The HW supports RDMA over
> TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
> (aka iWARP).  The device is a 10 GbE device, not Infiniband.  The
> Ethernet driver, upon which the rdma driver depends, acts both like a
> traditional Ethernet NIC for the Linux stack as well as a TCP offload
> device for the RDMA driver allowing establishment of RDMA connections.
> The Connection Manager (patch 04/13) sends/receives messages from the
> Ethernet driver that sets up HW TCP connections for doing RDMA.  While
> this is indeed implementing TCP offload, it is _not_ integrating it with
> the sockets layer nor the linux stack and offloading sockets
> connections.  Its only supporting offload connections for the RDMA
> driver to do iWARP.   The Ammasso device is another example of this
> (drivers/infiniband/hw/amso1100).  Deep iSCSI adapters are another
> example of this.

So what will happen when application will create a socket, bind it to
that NIC, and then try to establish a TCP connection? How NIC will
decide that received packets are from socket but not for internal TCP
state machine handled by that device?

As a side note, does all iwarp devices _require_ to have very
limited TCP engine implemented it in its hardware, or it is possible
to work with external SW stack?
 
> Steve.

-- 
	Evgeniy Polyakov


From rdreier at cisco.com  Mon Dec  4 21:13:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 21:13:59 -0800
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205050725.GA26033@2ka.mipt.ru> (Evgeniy Polyakov's
	message of "Tue, 5 Dec 2006 08:07:25 +0300")
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
Message-ID: <ada3b7uhqlk.fsf@cisco.com>

 > It is for iwarp/rdma from description.

Yes, iWARP on top of 10G ethernet.

 > If it is 10ge, then why does it parse incomping packet headers and
 > implements initial tcp state machine?

To establish connections to run RDMA over, I guess.  iWARP is RDMA
over TCP.

 - R.


From johnpol at 2ka.mipt.ru  Mon Dec  4 21:16:58 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 08:16:58 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <ada3b7uhqlk.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
Message-ID: <20061205051657.GB26845@2ka.mipt.ru>

On Mon, Dec 04, 2006 at 09:13:59PM -0800, Roland Dreier (rdreier at cisco.com) wrote:
>  > It is for iwarp/rdma from description.
> 
> Yes, iWARP on top of 10G ethernet.
> 
>  > If it is 10ge, then why does it parse incomping packet headers and
>  > implements initial tcp state machine?
> 
> To establish connections to run RDMA over, I guess.  iWARP is RDMA
> over TCP.

So will each new NIC implement some parts of TCP stack in theirs drivers?

>  - R.

-- 
	Evgeniy Polyakov


From rdreier at cisco.com  Mon Dec  4 21:27:09 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 04 Dec 2006 21:27:09 -0800
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205051657.GB26845@2ka.mipt.ru> (Evgeniy Polyakov's
	message of "Tue, 5 Dec 2006 08:16:58 +0300")
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
	<20061205051657.GB26845@2ka.mipt.ru>
Message-ID: <aday7pmgbf6.fsf@cisco.com>

 > So will each new NIC implement some parts of TCP stack in theirs drivers?

I hope not.  The driver we merged (amso1100) did it completely in FW,
with a separate MAC and IP interface for the RDMA connections.  I
think we better understand the Chelsio driver pretty well and think it
over carefully before we merge it.

Thanks for pointing this stuff out.

 - R.


From ogerlitz at voltaire.com  Tue Dec  5 02:31:46 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 05 Dec 2006 12:31:46 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA
 mappingfunctions to allow device drivers to interpose
In-Reply-To: <1165259844.14800.134.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<45728B6F.6040905@voltaire.com>
	<1165259844.14800.134.camel@brick.pathscale.com>
Message-ID: <45754A92.50102@voltaire.com>

Ralph Campbell wrote:

> I appreciate your pointing out the potential problems.  I agree that
> future kernel changes could certainly break existing drivers.  That
> happens frequently even when following the guarantees.

Assuming "an implementation detail that could change any time and that
you should not rely on" is too much to my taste, so its left for the IB 
maintainer to decide if to push it and for the kernel maintainer if to 
accept it.

While discussing it with the group here a was made that a possible 
solution for this problem would be on top of the suggested change call 
kmap_atomic/kunmap_atomic in the ipath low level code before/after you 
memcpy to/from a page provided to you by the IB consumer. But i am not 
sure if it solves the problem of ib_dma_map_sg for an sg provided later 
to the FMR code.

> I still don't understand how a valid struct page * (regardless of
> whether it is mapped into user space or not) can not have a valid
> kernel address when CONFIG_HIGHMEM is not defined for the current
> source base.  In include/linux/mm.h, page_address() is defined as
> lowmem_page_address() which is defined as
>         __va(page_to_pfn(page) << PAGE_SHIFT)
> which can only fail if there isn't a valid PFN for the page.
> I don't see how that can happen.

Looking on the matter again, I agree it can not fail for low memory with 
nowadays kernel code.

Or.


From Brice.Goglin at ens-lyon.org  Tue Dec  5 02:45:55 2006
From: Brice.Goglin at ens-lyon.org (Brice Goglin)
Date: Tue, 05 Dec 2006 11:45:55 +0100
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165249251.32724.26.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
Message-ID: <45754DE3.1020505@ens-lyon.org>

Steve Wise wrote:
> There is no SW TCP stack in this driver.  The HW supports RDMA over
> TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
> (aka iWARP).  The device is a 10 GbE device, not Infiniband.

Then, I wonder why the driver goes in drivers/infiniband/ :)

Is there really no way to only keep the actual hw infiniband there, move
iwarp/rdma drivers in drivers/net/something/ and the core stuff in
net/something/ ?

Brice


From tziporet at dev.mellanox.co.il  Tue Dec  5 03:54:25 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 05 Dec 2006 13:54:25 +0200
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
Message-ID: <45755DF1.5080208@dev.mellanox.co.il>

Jeff Squyres wrote:
> Who controls the DNS for openfabrics.org?  Could we get these names  
> created?  Or -- are there any objections to creating / using such names?
>
> Thanks!
>
>
>
>   

If I understand correctly Johann from Qlogic is responsible for the 
stage server setting.

Johan - can you drive this?

Thanks,
Tziporet


From tziporet at dev.mellanox.co.il  Tue Dec  5 04:03:14 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 05 Dec 2006 14:03:14 +0200
Subject: [openib-general] oops with multicast patches
In-Reply-To: <45746375.5010107@ichips.intel.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
Message-ID: <45756002.3030806@dev.mellanox.co.il>

Sean Hefty wrote:
>
> Okay... this will be difficult.  The state of the groups all appear to be 
> correct then.  Can you provide more information on your setup?  I cannot 
> reproduce this with my systems at all, even with kernel debugging enabled.  Do 
> you apply any patches to ipoib beyond the patches that I posted?
>
> - Sean
>
>   
Dotan will try to isolate the test that cause this failure and sent it 
to you, so you can debug it yourself.

Tziporet


From eeb at bartonsoftware.com  Tue Dec  5 04:22:13 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Tue, 5 Dec 2006 12:22:13 GMT
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
Message-ID: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com>


Hi,

We'd dearly like some help to understand why we seem to be having
performance issues with OFED.  When we run a lustre network bandwidth
benchmark, we find significant performance degradation on OFED versus
Voltaire...

             Premap (256 RDMA frags)     Map on demand (1 RDMA frag)
             Voltaire  OFED  Ratio       Voltaire  OFED  Ratio 
Writes MB/s  682       567   83 %        577       436   75 %
Reads MB/s   658       554   84 %        555       432   77 %

These tests measure the bandwidth of 1MByte transfers pipelined 8 deep.
All hardware/software was the same, apart from the IB stack and the lustre
network driver.

The architecture of the lustre network drivers for OFED and Voltaire are
almost identical.  Both use RC QPs with the same control message protocol
to set up bulk data transfers using RDMA WRITE.  Control messages use a
credit flow protocol to ensure that they are only sent when buffers are
posted to receive them.  Concurrent transfers over the same QP are
supported so that lustre can pipeline bulk I/O.

The only difference between the lustre network drivers is that the Voltaire
driver has a single global CQ and the OFED driver has 1 CQ per QP.  However
the measurement above are for a single pair of nodes - in this case both
implementations use a single CQ.

By default, the drivers pre-map all of physical memory so each RDMA
consists of page fragments.  However, we can also compile both drivers to
map on demand using FMR so that RDMA is not fragmented.  The results above
compare both methods and although both drivers perform worse when mapping,
the OFED driver takes the bigger hit.

We'd be delighted if anyone can shed any light or can suggest any steps we
should take to discover the reason.  We're also very willing to provide
assistance if any of the OpenFabrics developers wants to duplicate the
setup.

-- 

                Cheers,
                        Eric


From bugzilla-daemon at openib.org  Tue Dec  5 05:45:21 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Tue,  5 Dec 2006 05:45:21 -0800 (PST)
Subject: [openib-general] [Bug 306] New: Run IPOIB high availability when
	primary I/F == secondary I/F does not return an error
Message-ID: <20061205134521.559522283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=306

           Summary: Run IPOIB high availability when primary I/F ==
                    secondary I/F does not return an error
           Product: OpenFabrics Linux
           Version: gen2
          Platform: All
        OS/Version: Other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: IPoIB
        AssignedTo: bugzilla at openib.org
        ReportedBy: yohadd at mellanox.co.il


When configuring the IPOIB high availability with primary I/F == secondary I/F,
the high availability script (ipoib_ha.pl) doesn't return an error.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Tue Dec  5 05:56:11 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Tue,  5 Dec 2006 05:56:11 -0800 (PST)
Subject: [openib-general] [Bug 307] New: Configuring IPOIB HA with invalid
	I/F does not return an error
Message-ID: <20061205135611.30F022283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=307

           Summary: Configuring IPOIB HA with invalid I/F does not return an
                    error
           Product: OpenFabrics Linux
           Version: gen2
          Platform: All
        OS/Version: Other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: IPoIB
        AssignedTo: bugzilla at openib.org
        ReportedBy: yohadd at mellanox.co.il


When configuring the IPOIB HA with invalid I/F, the HA script (ipoib_ha.pl)
notify about the wrong I/F, but it continue to run with the wrong configuration
(does not exit with an error).


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From dotanb at dev.mellanox.co.il  Tue Dec  5 06:32:08 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Tue, 05 Dec 2006 16:32:08 +0200
Subject: [openib-general] oops with multicast patches
In-Reply-To: <45756002.3030806@dev.mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
Message-ID: <457582E8.8030705@dev.mellanox.co.il>

Hi Sean.

Tziporet Koren wrote:

>Dotan will try to isolate the test that cause this failure and sent it 
>to you, so you can debug it yourself.
>
>Tziporet
>  
>
We got a machine crash on a machine with the following attributes:

*************************************************************
Host Architecture : x86_64
Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 3)
Kernel Version    : 2.6.9-34.ELsmp
GCC Version       : gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
Memory size       : 2055996 kB
HCA ID(s)         : mthca0
HCA model(s)      : 23108
Board(s)          : MT_0030000001
*************************************************************

I attached the test to this email.


This test does the following scenario:
    restart the driver
    start a user level application that allocate N multicast groups (it 
is being executed in the background)
    sleep for a while (to let the later application get the mcgs)
    start the SM (in the background)
    sleep for a while
    kill the SM
    wait until the user level application will ends

We do it in a loop for the following values of N: max_mcast -1, 
max_mcast -2, max_mcast -3

I executed the following command (only one side is needed):
# ./ib_mcast_full.bs --server

The test need to be executed when the driver was loaded and the opensm 
isn't executed in the background.
The user level application uses the VL library which can be found in:
    https://openib.org/svn/trunk/contrib/mellanox/ibtp/common/tools/vl

I hope that this will help you ....
Dotan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ib_mcast_full.tar.gz
Type: application/x-gzip
Size: 4407 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/ed2c2ab6/attachment.bin>

From swise at opengridcomputing.com  Tue Dec  5 07:02:05 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:02:05 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205050725.GA26033@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
Message-ID: <1165330925.16087.13.camel@stevo-desktop>

On Tue, 2006-12-05 at 08:07 +0300, Evgeniy Polyakov wrote:
> On Mon, Dec 04, 2006 at 07:45:52AM -0800, Roland Dreier (rdreier at cisco.com) wrote:
> >  > This and a lot of other changes in this driver definitely says you
> >  > implement your own stack of protocols on top of infiniband hardware.
> > 
> > ...but I do know this driver is for 10-gig ethernet HW.
> 
> It is for iwarp/rdma from description.
> If it is 10ge, then why does it parse incomping packet headers and
> implements initial tcp state machine?
> 

Its not implementing the TCP state machine at all. Its implementing the
MPA state machine (see the iWARP internet drafts).  These packets are
TCP payload.  MPA is used to negotiate RDMA mode on a TCP connection.
This entails an exchange of 2 messages on the TCP connection.  Once this
is exchanged and both side agree, the connection is bound to an RDMA QP
and the connection moved into RDMA mode.  From that point on, all IO is
done via the post_send() and post_recv().


Steve. 


From swise at opengridcomputing.com  Tue Dec  5 07:03:35 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:03:35 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <ada3b7uhqlk.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
Message-ID: <1165331015.16087.16.camel@stevo-desktop>

On Mon, 2006-12-04 at 21:13 -0800, Roland Dreier wrote:
>  > It is for iwarp/rdma from description.
> 
> Yes, iWARP on top of 10G ethernet.
> 
>  > If it is 10ge, then why does it parse incomping packet headers and
>  > implements initial tcp state machine?
> 
> To establish connections to run RDMA over, I guess.  iWARP is RDMA
> over TCP.
> 

The driver uses messages exchanged to and from the HW via the Ethernet
driver to setup TCP connections.  No TCP processing is done in the host.
The hardware does all the TCP processing.


Steve.


From halr at voltaire.com  Tue Dec  5 07:01:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 10:01:36 -0500
Subject: [openib-general] {PATCH 0/2] OpenSM and osmtest: Add support for SA
 InformInfoRecord
Message-ID: <1165330881.25587.66892.camel@hal.voltaire.com>

OpenSM and osmtest: Add support for SA InformInfoRecord

The following patch series adds initial SA InformInfoRecord support into
OpenSM and also adds some tests for this and InformInfo into osmtest.

There will also be subsequent patches for enhancements to the SA
InformInfo and InformInfoRecord support.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>


From swise at opengridcomputing.com  Tue Dec  5 07:07:33 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:07:33 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205051356.GA26845@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
	<20061205051356.GA26845@2ka.mipt.ru>
Message-ID: <1165331253.16087.21.camel@stevo-desktop>

On Tue, 2006-12-05 at 08:13 +0300, Evgeniy Polyakov wrote:
> On Mon, Dec 04, 2006 at 10:20:51AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > >  > This and a lot of other changes in this driver definitely says you
> > >  > implement your own stack of protocols on top of infiniband hardware.
> > > 
> > > ...but I do know this driver is for 10-gig ethernet HW.
> > > 
> > 
> > There is no SW TCP stack in this driver.  The HW supports RDMA over
> > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
> > (aka iWARP).  The device is a 10 GbE device, not Infiniband.  The
> > Ethernet driver, upon which the rdma driver depends, acts both like a
> > traditional Ethernet NIC for the Linux stack as well as a TCP offload
> > device for the RDMA driver allowing establishment of RDMA connections.
> > The Connection Manager (patch 04/13) sends/receives messages from the
> > Ethernet driver that sets up HW TCP connections for doing RDMA.  While
> > this is indeed implementing TCP offload, it is _not_ integrating it with
> > the sockets layer nor the linux stack and offloading sockets
> > connections.  Its only supporting offload connections for the RDMA
> > driver to do iWARP.   The Ammasso device is another example of this
> > (drivers/infiniband/hw/amso1100).  Deep iSCSI adapters are another
> > example of this.
> 
> So what will happen when application will create a socket, bind it to
> that NIC, and then try to establish a TCP connection? How NIC will
> decide that received packets are from socket but not for internal TCP
> state machine handled by that device?

The HW knows which TCP connections are offloaded by virtue of the fact
that they were setup via the RDMA subsystem.  Any other TCP traffic (and
all other non TCP traffic) gets passed to the host stack.

> 
> As a side note, does all iwarp devices _require_ to have very
> limited TCP engine implemented it in its hardware, or it is possible
> to work with external SW stack?

It is possible, but not very interesting.

One could implement an all-software iWARP stack.  The iWARP protocols
are just TCP payload and _could_ be implemented in user mode on top of a
socket.  However, this isn't very interesting:  the goal of iWARP (and
RDMA for that matter) is to allow direct placement of data into user
memory with 0 copies done by the host CPU.  low latency.

Steve.


From halr at voltaire.com  Tue Dec  5 07:02:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 10:02:43 -0500
Subject: [openib-general] [PATCH 2/2]: osmtest/osmtest.c: Add tests for SA
 InformInfoRecord and InformInfo
Message-ID: <1165330909.25587.66946.camel@hal.voltaire.com>

osmtest/osmtest.c: Add tests for SA InformInfoRecord and InformInfo

The following patch adds some tests for SA InformInfoRecord and
InformInfo into osmtest.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index a21e8ca..b3f2bb4 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -69,6 +69,18 @@
 #define POOL_MIN_ITEMS  64
 #define GUID_ARRAY_SIZE 64
 
+typedef struct _osmtest_inform_info
+{
+  boolean_t subscribe;
+  ib_net32_t qpn;
+} osmtest_inform_info_t;
+
+typedef struct _osmtest_inform_info_rec
+{
+  ib_gid_t subscriber_gid;
+  ib_net16_t subscriber_enum;
+} osmtest_inform_info_rec_t;
+
 typedef enum _osmtest_token_val
 {
     OSMTEST_TOKEN_COMMENT = 0,
@@ -4814,6 +4826,119 @@ osmtest_sminfo_record_request(
   OSM_LOG_EXIT( &p_osmt->log );
   return ( status );
 }
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osmtest_informinfo_request(
+	IN osmtest_t * const p_osmt,
+	IN ib_net16_t attr_id,
+	IN uint8_t method,
+	IN void *p_options,
+	IN OUT osmtest_req_context_t * const p_context )
+{
+  ib_api_status_t status = IB_SUCCESS;
+  osmv_user_query_t user;
+  osmv_query_req_t req;
+  ib_inform_info_t rec;
+  ib_inform_info_record_t record;
+  ib_mad_t *p_mad;
+  osmtest_inform_info_t *p_inform_info_opt;
+  osmtest_inform_info_rec_t *p_inform_info_rec_opt;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_informinfo_request );
+
+  /*
+   * Do a blocking query for these records in the subnet.
+   * The result is returned in the result field of the caller's
+   * context structure.
+   *
+   * The query structures are locals.
+   */
+  memset( &req, 0, sizeof( req ) );
+  memset( &user, 0, sizeof( user ) );
+  memset( &rec, 0, sizeof( rec ) );
+  memset( &record, 0, sizeof( record ) );
+
+  p_context->p_osmt = p_osmt;
+  user.attr_id = attr_id;
+  if (attr_id == IB_MAD_ATTR_INFORM_INFO_RECORD)
+  {
+    user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) );
+    p_inform_info_rec_opt = p_options;
+    if (p_inform_info_rec_opt->subscriber_gid.unicast.prefix != 0 &&
+        p_inform_info_rec_opt->subscriber_gid.unicast.interface_id != 0)
+    {
+       record.subscriber_gid = p_inform_info_rec_opt->subscriber_gid;
+       user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBERGID;
+    }
+    record.subscriber_enum = cl_hton16(p_inform_info_rec_opt->subscriber_enum);
+    user.comp_mask |= IB_IIR_COMPMASK_ENUM;
+    user.p_attr = &record;
+  }
+  else
+  {
+    user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( rec ) >> 3 ) );
+    /* comp mask bits below are for InformInfoRecord rather than InformInfo */
+    /* as currently no comp mask bits defined for InformInfo!!! */
+    user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBE;
+    p_inform_info_opt = p_options;
+    rec.subscribe = p_inform_info_opt->subscribe;
+    if (p_inform_info_opt->qpn)
+    {
+      rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8;
+      user.comp_mask |= IB_IIR_COMPMASK_QPN;
+    }
+    user.p_attr = &rec;
+  }
+  user.method = method;
+
+  req.query_type = OSMV_QUERY_USER_DEFINED;
+  req.timeout_ms = p_osmt->opt.transaction_timeout;
+  req.retry_cnt = p_osmt->opt.retry_count;
+
+  req.flags = OSM_SA_FLAGS_SYNC;
+  req.query_context = p_context;
+  req.pfn_query_cb = osmtest_query_res_cb;
+  req.p_query_input = &user;
+  req.sm_key = 0;
+
+  status = osmv_query_sa( p_osmt->h_bind, &req );
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_informinfo_request: ERR 008E: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    goto Exit;
+  }
+
+  status = p_context->result.status;
+
+  if( status != IB_SUCCESS )
+  {
+    if (status != IB_INVALID_PARAMETER)
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_informinfo_request: ERR 008F: "
+               "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    }
+    if( status == IB_REMOTE_ERROR )
+    {
+      p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw );
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_informinfo_request: "
+               "Remote error = %s\n",
+               ib_get_mad_status_str( p_mad ));
+
+      status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK );
+    }
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
 #endif
 
 /**********************************************************************
@@ -5421,6 +5546,8 @@ osmtest_validate_against_db( IN osmtest_
 {
   ib_api_status_t status = IB_SUCCESS;
   ib_gid_t portgid, mgid;
+  osmtest_inform_info_t inform_info_opt;
+  osmtest_inform_info_rec_t inform_info_rec_opt;
 #ifdef VENDOR_RMPP_SUPPORT
   ib_net64_t sm_key;
   ib_net16_t test_lid;
@@ -5684,6 +5811,121 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  /* InformInfoRecord tests */
+  memset( &inform_info_opt, 0, sizeof( inform_info_opt ) );
+  memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+				       IB_MAD_METHOD_SET, &inform_info_rec_opt, &context );
+  if ( status == IB_SUCCESS )
+    goto Exit;
+  else
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_informinfo_request: InformInfoRecord "
+             "IS EXPECTED ERROR ^^^^\n");
+  }
+
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+				       IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* InformInfo tests */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_GET, &inform_info_opt, &context );
+  if ( status == IB_SUCCESS )
+    goto Exit;
+  else
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_informinfo_request: InformInfo "
+             "IS EXPECTED ERROR ^^^^\n");
+  }
+
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status == IB_SUCCESS )
+    goto Exit;
+  else
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_informinfo_request: InformInfo UnSubscribe "
+             "IS EXPECTED ERROR ^^^^\n");
+  }
+
+  /* Now subscribe */
+  inform_info_opt.subscribe = TRUE;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Now unsubscribe (QPN needs to be 1 to work) */
+  inform_info_opt.subscribe = FALSE;
+  inform_info_opt.qpn = 1;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Now subscribe again */
+  inform_info_opt.subscribe = TRUE;
+  inform_info_opt.qpn = 1;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Subscribe over existing subscription */
+  inform_info_opt.qpn = 0;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* More InformInfoRecord tests */
+  /* RID lookup */
+  ib_gid_set_default( &inform_info_rec_opt.subscriber_gid,
+                      p_osmt->local_port.port_guid );
+  inform_info_rec_opt.subscriber_enum = 1;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  inform_info_rec_opt.subscriber_enum = 0;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Get all InformInfoRecords */
+  memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE, &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Cleanup subscriptions before further testing */
+  inform_info_opt.subscribe = FALSE;
+  inform_info_opt.qpn = 1;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
   if (lmc != 0)
   {
     test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 );


From halr at voltaire.com  Tue Dec  5 07:02:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 10:02:03 -0500
Subject: [openib-general] [PATCH 1/2] OpenSM: Add support for SA
	InformInfoRecord
Message-ID: <1165330893.25587.66894.camel@hal.voltaire.com>

OpenSM: Add support for SA InformInfoRecord

The following patch adds initial SA InformInfoRecord support into
OpenSM.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h
index 40fec93..0bc8810 100644
--- a/osm/include/opensm/osm_inform.h
+++ b/osm/include/opensm/osm_inform.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -254,6 +254,72 @@ osm_infr_get_by_rid(
 *	Inform Record, osm_infr_construct, osm_infr_destroy
 *********/
 
+/****f* OpenSM: Inform Record/osm_infr_get_by_gid
+* NAME
+*	osm_infr_get_by_gid
+*
+* DESCRIPTION
+*	Find a matching osm_infr_t in the subnet DB by inform_info_record
+*	subscriber GID
+*
+* SYNOPSIS
+*/
+osm_infr_t*
+osm_infr_get_by_gid(
+	IN osm_subn_t	const	*p_subn,
+	IN osm_log_t	*p_log,
+	IN ib_inform_info_record_t* const p_inf_rec );
+/*
+* PARAMETERS
+*	p_subn 
+*		[in] Pointer to the subnet object
+*
+*	p_log
+*		[in] Pointer to the log object
+*
+*	p_inf_rec
+*		[in] Pointer to an inform_info record with the search
+*		     subscriber GID
+*
+* RETURN
+*	The matching osm_infr_t
+* SEE ALSO
+*	Inform Record, osm_infr_construct, osm_infr_destroy
+*********/
+
+/****f* OpenSM: Inform Record/osm_infr_get_by_enum
+* NAME
+*       osm_infr_get_by_enum
+*
+* DESCRIPTION
+*       Find a matching osm_infr_t in the subnet DB by inform_info_record
+*       subscriber enum 
+*
+* SYNOPSIS
+*/
+osm_infr_t*
+osm_infr_get_by_enum(
+	IN osm_subn_t	const	*p_subn,
+	IN osm_log_t	*p_log,
+	IN ib_inform_info_record_t* const p_inf_rec );
+/*
+* PARAMETERS
+*	p_subn 
+*		[in] Pointer to the subnet object
+*
+*	p_log
+*		[in] Pointer to the log object
+*
+*	p_inf_rec
+*		[in] Pointer to an inform_info record with the search
+*		     subscriber enum 
+*
+* RETURN
+*	The matching osm_infr_t
+* SEE ALSO
+*	Inform Record, osm_infr_construct, osm_infr_destroy
+*********/
+
 /****f* OpenSM: Inform Record/osm_infr_get_by_rec
 * NAME
 *	osm_infr_get_by_rec
diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h
index 4439339..73af838 100644
--- a/osm/include/opensm/osm_msgdef.h
+++ b/osm/include/opensm/osm_msgdef.h
@@ -191,6 +191,7 @@ enum
 	OSM_MSG_MAD_VL_ARB,
 	OSM_MSG_MAD_SLVL,
 	OSM_MSG_MAD_GUIDINFO_RECORD,
+	OSM_MSG_MAD_INFORM_INFO_RECORD,
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
 	OSM_MSG_MAD_MULTIPATH_RECORD,
 #endif
diff --git a/osm/include/opensm/osm_sa_informinfo.h b/osm/include/opensm/osm_sa_informinfo.h
index 2e57f43..c22c1eb 100644
--- a/osm/include/opensm/osm_sa_informinfo.h
+++ b/osm/include/opensm/osm_sa_informinfo.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -33,7 +33,6 @@
  *
  */
 
-
 /*
  * Abstract:
  * 	Declaration of osm_infr_rcv_t.
@@ -108,6 +107,7 @@ typedef struct _osm_infr_rcv
 	osm_mad_pool_t		*p_mad_pool;
 	osm_log_t		*p_log;
 	cl_plock_t		*p_lock;
+	cl_qlock_pool_t		pool;
 } osm_infr_rcv_t;
 /*
 * FIELDS
@@ -123,6 +123,10 @@ typedef struct _osm_infr_rcv
 *	p_lock
 *		Pointer to the serializing lock.
 *
+*	pool
+*		Pool of linkable InformInfo Record objects used to
+*		generate the query response.
+*
 * SEE ALSO
 *	InformInfo Receiver object
 *********/
@@ -262,6 +266,34 @@ osm_infr_rcv_process(
 *	InformInfo Receiver
 *********/
 
+/****f* OpenSM: InformInfo Record Receiver/osm_infir_rcv_process
+* NAME
+*	osm_infir_rcv_process
+*
+* DESCRIPTION
+*	Process the InformInfo Record request.
+*
+* SYNOPSIS
+*/
+void
+osm_infir_rcv_process(
+	IN osm_infr_rcv_t*		const p_rcv,
+	IN const osm_madw_t*		const p_madw );
+/*
+* PARAMETERS
+*	p_rcv
+*		[in] Pointer to an osm_infr_rcv_t object.
+*
+*	p_madw
+*		[in] Pointer to the MAD Wrapper containing the MAD
+*		that contains the node's InformInfo Record attribute.
+* NOTES
+*	This function processes a InformInfo Record attribute.
+*
+* SEE ALSO
+*	InformInfo Receiver
+*********/
+
 END_C_DECLS
 
 #endif	/* _OSM_SA_INFR_H_ */
diff --git a/osm/include/opensm/osm_sa_informinfo_ctrl.h b/osm/include/opensm/osm_sa_informinfo_ctrl.h
index 21dd0a7..a14c5b4 100644
--- a/osm/include/opensm/osm_sa_informinfo_ctrl.h
+++ b/osm/include/opensm/osm_sa_informinfo_ctrl.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -103,6 +103,7 @@ typedef struct _osm_infr_rcv_ctrl
   osm_log_t			*p_log;
   cl_dispatcher_t		*p_disp;
   cl_disp_reg_handle_t		h_disp;
+  cl_disp_reg_handle_t		h_disp2;
 } osm_infr_rcv_ctrl_t;
 /*
 * FIELDS
diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c
index 178dba2..92647ef 100644
--- a/osm/opensm/osm_inform.c
+++ b/osm/opensm/osm_inform.c
@@ -94,7 +94,7 @@ osm_infr_init(
   /* what else do we need in the inform_record ??? */
 
   /* copy the contents of the provided informinfo */
-  memcpy(p_infr,p_infr_rec, sizeof(osm_infr_t));
+  memcpy(p_infr, p_infr_rec, sizeof(osm_infr_t));
 }
 
 /**********************************************************************
@@ -143,6 +143,54 @@ __match_rid_of_inf_rec(
 }
 
 /**********************************************************************
+ * Match an infr by the subscriber GID of the stored inform_info_record
+ **********************************************************************/
+static
+cl_status_t
+__match_gid_of_inf_rec(
+  IN  const cl_list_item_t* const p_list_item,
+  IN  void*                       context )
+{
+  ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t  *)context;
+  osm_infr_t* p_infr = (osm_infr_t*)p_list_item;
+  int32_t count;
+
+  count = memcmp(
+    &p_infr->inform_record,
+    p_infr_rec,
+    sizeof(p_infr_rec->subscriber_gid) );
+
+  if(count == 0)
+    return CL_SUCCESS;
+  else
+    return CL_NOT_FOUND;
+}
+
+/**********************************************************************
+ * Match an infr by the subscriber enum of the stored inform_info_record
+ **********************************************************************/
+static
+cl_status_t
+__match_enum_of_inf_rec(
+  IN  const cl_list_item_t* const p_list_item,
+  IN  void*                       context )
+{
+  ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t  *)context;
+  osm_infr_t* p_infr = (osm_infr_t*)p_list_item;
+  int32_t count;
+
+  count = memcmp(
+    &p_infr->inform_record.subscriber_enum,
+    &p_infr_rec->subscriber_enum,
+    sizeof(p_infr_rec->subscriber_enum) );
+
+  if(count == 0)
+    return CL_SUCCESS;
+  else
+    return CL_NOT_FOUND;
+}
+
+/**********************************************************************
  **********************************************************************/
 osm_infr_t*
 osm_infr_get_by_rid(
@@ -168,6 +216,54 @@ osm_infr_get_by_rid(
 
 /**********************************************************************
  **********************************************************************/
+osm_infr_t*
+osm_infr_get_by_gid(
+  IN osm_subn_t const *p_subn,
+  IN osm_log_t *p_log,
+  IN ib_inform_info_record_t* const p_infr_rec )
+{
+  cl_list_item_t* p_list_item;
+
+  OSM_LOG_ENTER( p_log, osm_infr_get_by_gid );
+
+  p_list_item = cl_qlist_find_from_head(
+    &p_subn->sa_infr_list,
+    __match_gid_of_inf_rec,
+    p_infr_rec );
+
+  if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) )
+    p_list_item = NULL;
+
+  OSM_LOG_EXIT( p_log );
+  return (osm_infr_t*)p_list_item;
+}
+
+/**********************************************************************
+ **********************************************************************/
+osm_infr_t*
+osm_infr_get_by_enum(
+  IN osm_subn_t const *p_subn,
+  IN osm_log_t *p_log,
+  IN ib_inform_info_record_t* const p_infr_rec )
+{
+  cl_list_item_t* p_list_item;
+
+  OSM_LOG_ENTER( p_log, osm_infr_get_by_enum );
+
+  p_list_item = cl_qlist_find_from_head(
+    &p_subn->sa_infr_list,
+    __match_enum_of_inf_rec,
+    p_infr_rec );
+
+  if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) )
+    p_list_item = NULL;
+
+  OSM_LOG_EXIT( p_log );
+  return (osm_infr_t*)p_list_item;
+}
+
+/**********************************************************************
+ **********************************************************************/
 void
 __dump_all_informs(
     IN osm_subn_t  const *p_subn,
diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c
index c979365..2667e49 100644
--- a/osm/opensm/osm_sa_informinfo.c
+++ b/osm/opensm/osm_sa_informinfo.c
@@ -33,7 +33,6 @@
  *
  */
 
-
 /*
  * Abstract:
  *    Implementation of osm_infr_rcv_t.
@@ -67,6 +66,26 @@
 #include <opensm/osm_inform.h>
 #include <opensm/osm_pkey.h>
 
+#define OSM_IIR_RCV_POOL_MIN_SIZE      32
+#define OSM_IIR_RCV_POOL_GROW_SIZE     32
+
+typedef struct _osm_iir_item
+{
+  cl_pool_item_t          pool_item;
+  ib_inform_info_record_t rec;
+} osm_iir_item_t;
+
+typedef struct _osm_iir_search_ctxt
+{
+  const ib_inform_info_record_t*  p_rcvd_rec;
+  ib_net64_t                      comp_mask;
+  cl_qlist_t*                     p_list;
+  ib_gid_t                        subscriber_gid;
+  ib_net16_t                      subscriber_enum;
+  osm_infr_rcv_t*                 p_rcv;
+  osm_physp_t*                    p_req_physp;
+} osm_iir_search_ctxt_t;
+
 /**********************************************************************
  **********************************************************************/
 void
@@ -74,6 +93,7 @@ osm_infr_rcv_construct(
   IN osm_infr_rcv_t* const p_rcv )
 {
   memset( p_rcv, 0, sizeof(*p_rcv) );
+  cl_qlock_pool_construct( &p_rcv->pool );
 }
 
 /**********************************************************************
@@ -85,7 +105,7 @@ osm_infr_rcv_destroy(
   CL_ASSERT( p_rcv );
 
   OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_destroy );
-
+  cl_qlock_pool_destroy( &p_rcv->pool );
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
@@ -112,7 +132,12 @@ osm_infr_rcv_init(
   p_rcv->p_resp = p_resp;
   p_rcv->p_mad_pool = p_mad_pool;
 
-  status = IB_SUCCESS;
+  status = cl_qlock_pool_init( &p_rcv->pool,
+                               OSM_IIR_RCV_POOL_MIN_SIZE,
+                               0,
+                               OSM_IIR_RCV_POOL_GROW_SIZE,
+                               sizeof(osm_iir_item_t),
+                               NULL, NULL, NULL );
 
   OSM_LOG_EXIT( p_rcv->p_log );
   return( status );
@@ -333,6 +358,339 @@ __osm_infr_rcv_respond(
 }
 
 /**********************************************************************
+ **********************************************************************/
+static void
+__osm_sa_inform_info_rec_by_comp_mask(
+  IN osm_infr_rcv_t*       const p_rcv,
+  IN const osm_infr_t*     const p_infr,
+  osm_iir_search_ctxt_t*   const p_ctxt )
+{
+  const ib_inform_info_record_t* p_rcvd_rec = NULL; 
+  ib_net64_t               comp_mask;
+  ib_net64_t               portguid;
+  osm_port_t *             p_subscriber_port;
+  osm_physp_t *            p_subscriber_physp;
+  const osm_physp_t*       p_req_physp;
+  osm_infr_t*              p_infr_rec = NULL;
+  ib_inform_info_record_t  inform_info_rec;
+  osm_iir_item_t*          p_rec_item;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_inform_info_rec_by_comp_mask );
+
+  p_rcvd_rec = p_ctxt->p_rcvd_rec;
+  comp_mask = p_ctxt->comp_mask;
+  p_req_physp = p_ctxt->p_req_physp;
+
+  /* Both subscriber GID and enum specified */
+  if ((comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) &&
+      (comp_mask & IB_IIR_COMPMASK_ENUM))
+  {
+    inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid;
+    inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum;
+    p_infr_rec = osm_infr_get_by_rid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
+    goto Done;
+  }
+
+  if (comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID)
+  {
+    inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid;
+    p_infr_rec = osm_infr_get_by_gid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
+    goto Done;
+  }
+
+  if (comp_mask & IB_IIR_COMPMASK_ENUM)
+  {
+    inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum;
+    p_infr_rec = osm_infr_get_by_enum(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
+    goto Done;
+  }
+
+  /* Implement any other needed search cases */
+
+Done:
+  if (p_infr_rec)
+  {
+    /* Ensure pkey is shared before returning any records */
+    portguid = p_infr_rec->inform_record.subscriber_gid.unicast.interface_id;
+    p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid);
+    if ( p_subscriber_port == NULL )
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: "
+               "Invalid subscriber port guid: 0x%016" PRIx64 "\n",
+               cl_ntoh64(portguid) );
+      goto Exit;
+    }
+
+    /* get the subscriber InformInfo physical port */
+    p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port);
+    /* make sure that the requester and subscriber port can access each other 
+       according to the current partitioning. */
+    if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp))
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__osm_sa_inform_info_rec_by_comp_mask: "
+               "requester and subscriber ports don't share pkey\n" );
+      goto Exit;
+    }
+ 
+    p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool );
+    if( p_rec_item == NULL )
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: "
+               "cl_qlock_pool_get failed\n" );
+      goto Exit;
+    }
+
+    memcpy((void *)&p_rec_item->rec, (void *)&p_infr_rec->inform_record, sizeof(ib_inform_info_record_t));
+    cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+  }
+
+Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_sa_inform_info_rec_by_comp_mask_cb(
+  IN cl_list_item_t*       const p_list_item,
+  IN void*                 context )
+{
+  const osm_infr_t* const p_infr = (osm_infr_t *)p_list_item;
+  osm_iir_search_ctxt_t*   const p_ctxt = (osm_iir_search_ctxt_t *)context;
+
+  __osm_sa_inform_info_rec_by_comp_mask( p_ctxt->p_rcv, p_infr, p_ctxt );
+}
+
+/**********************************************************************
+Received a Get(InformInfoRecord) or GetTable(InformInfoRecord) MAD
+**********************************************************************/
+static void
+osm_infr_rcv_process_get_method(
+  IN osm_infr_rcv_t*      const p_rcv,
+  IN const osm_madw_t*    const p_madw )
+{
+  ib_sa_mad_t*            p_rcvd_mad;
+  const ib_inform_info_record_t* p_rcvd_rec;
+  ib_inform_info_record_t* p_resp_rec;
+  cl_qlist_t              rec_list;
+  osm_madw_t*             p_resp_madw;
+  ib_sa_mad_t*            p_resp_sa_mad;
+  uint32_t                num_rec, pre_trim_num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  uint32_t                trim_num_rec;
+#endif
+  uint32_t                i, j;
+  osm_iir_search_ctxt_t   context;
+  osm_iir_item_t*         p_rec_item;
+  ib_api_status_t         status = IB_SUCCESS;
+  osm_physp_t*            p_req_physp;
+
+  OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process_get_method );
+
+  CL_ASSERT( p_madw );
+  p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw );
+  p_rcvd_rec =
+    (ib_inform_info_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad );
+
+  /* update the requester physical port. */
+  p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log,
+                                          p_rcv->p_subn,
+                                          osm_madw_get_mad_addr_ptr(p_madw) );
+  if (p_req_physp == NULL)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_infr_rcv_process_get_method: ERR 4309: "
+             "Cannot find requester physical port\n" );
+    goto Exit;
+  }
+
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+    osm_dump_inform_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG );
+
+  cl_qlist_init( &rec_list );
+
+  context.p_rcvd_rec = p_rcvd_rec;
+  context.p_list = &rec_list;
+  context.comp_mask = p_rcvd_mad->comp_mask;
+  context.subscriber_gid = p_rcvd_rec->subscriber_gid;
+  context.subscriber_enum = p_rcvd_rec->subscriber_enum;
+  context.p_rcv = p_rcv;
+  context.p_req_physp = p_req_physp;
+
+  osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+           "osm_infr_rcv_process_get_method: "
+           "Query Subscriber GID:0x%016" PRIx64 " : 0x%016" PRIx64 "(%02X) Enum:0x%X(%02X)\n",
+           cl_ntoh64(p_rcvd_rec->subscriber_gid.unicast.prefix),
+           cl_ntoh64(p_rcvd_rec->subscriber_gid.unicast.interface_id),
+           (p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) != 0,
+           cl_ntoh16(p_rcvd_rec->subscriber_enum),
+           (p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_ENUM) != 0 );
+
+  /* Only Enum 0 is supported currently!!! */
+  if (((p_rcvd_mad->comp_mask & IB_IIR_COMPMASK_ENUM) == 0) || (p_rcvd_rec->subscriber_enum == 0))
+  {
+    cl_plock_acquire( p_rcv->p_lock );
+
+    cl_qlist_apply_func( &p_rcv->p_subn->sa_infr_list,
+                         __osm_sa_inform_info_rec_by_comp_mask_cb,
+                         &context );
+
+    cl_plock_release( p_rcv->p_lock );
+  }
+  else
+  {
+     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+              "osm_infr_rcv_process_get_method: "
+              "Non-zero Enum is not currently supported\n" );
+  }
+
+  num_rec = cl_qlist_count( &rec_list );
+
+  /*
+   * C15-0.1.30:
+   * If we do a SubnAdmGet and got more than one record it is an error !
+   */
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
+    {
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
+    }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_infr_rcv_process_get_method: ERR 430A: "
+               "More than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
+
+      /* need to set the mem free ... */
+      p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_iir_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
+  }
+
+  pre_trim_num_rec = num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we limit the number of records to a single packet */
+  trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_inform_info_record_t);
+  if (trim_num_rec < num_rec)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
+             "osm_infr_rcv_process_get_method: "
+             "Number of records:%u trimmed to:%u to fit in one MAD\n",
+             num_rec, trim_num_rec );
+    num_rec = trim_num_rec;
+  }
+#endif
+
+  osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+           "osm_infr_rcv_process_get_method: "
+           "Returning %u records\n", num_rec );
+
+  /* 
+   * Get a MAD to reply. Address of Mad is in the received mad_wrapper
+   */
+  p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool,
+                                  p_madw->h_bind,
+                                  num_rec * sizeof(ib_inform_info_record_t) + IB_SA_MAD_HDR_SIZE,
+                                  &p_madw->mad_addr );
+
+  if( !p_resp_madw )
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_infr_rcv_process_get_method: ERR 430B: "
+            "osm_mad_pool_get failed\n" );
+
+    for( i = 0; i < num_rec; i++ )
+    {
+      p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list );
+      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    }
+
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_NO_RESOURCES );
+
+    goto Exit;
+  }
+
+  p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw );
+
+  /*
+    Copy the MAD header back into the response mad.
+    Set the 'R' bit and the payload length,
+    Then copy all records from the list into the response payload.
+  */
+
+  memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE );
+  p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK;
+  /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
+  p_resp_sa_mad->sm_key = 0;
+  /* Fill in the offset (paylen will be done by the rmpp SAR) */
+  p_resp_sa_mad->attr_offset =
+    ib_get_attr_offset( sizeof(ib_inform_info_record_t) );
+
+  p_resp_rec = (ib_inform_info_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
+
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we support only one packet RMPP - so we will set the first and
+     last flags for gettable */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+  {
+    p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA;
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE;
+  }
+#else
+  /* forcefully define the packet as RMPP one */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE;
+#endif
+
+  for( i = 0; i < pre_trim_num_rec; i++ )
+  {
+    p_rec_item = (osm_iir_item_t*)cl_qlist_remove_head( &rec_list );
+    /* copy only if not trimmed */
+    if (i < num_rec)
+    {
+      *p_resp_rec = p_rec_item->rec;
+      /* clear reserved and pad fields in InformInfoRecord */
+      for (j = 0; j < 6; j++)
+        p_resp_rec->reserved[j] = 0;
+      for (j = 0; j < 4; j++)
+        p_resp_rec->pad[j] = 0;
+    }
+    cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    p_resp_rec++;
+  }
+
+  CL_ASSERT( cl_is_qlist_empty( &rec_list ) );
+
+  status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE );
+  if (status != IB_SUCCESS)
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_infr_rcv_process_get_method: ERR 430C: "
+            "osm_vendor_send status = %s\n",
+            ib_get_err_str(status));
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/*********************************************************************
 Received a Set(InformInfo) MAD
 **********************************************************************/
 static void
@@ -395,6 +753,12 @@ osm_infr_rcv_process_set_method(
     osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
     goto Exit;
   }
+osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+         "osm_infr_rcv_process_set_method: "
+         "LID 0x%04X GID 0x%016" PRIx64 " : 0x%016" PRIx64"\n",
+         cl_ntoh16(p_madw->mad_addr.dest_lid),
+         cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.prefix),
+         cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.interface_id));
 
   /*
    * MODIFICATIONS DONE ON INCOMING REQUEST:
@@ -472,7 +836,6 @@ osm_infr_rcv_process_set_method(
 
       /* Add this new osm_infr_t object to subnet object */
       osm_infr_insert_to_db( p_rcv->p_subn, p_rcv->p_log, p_infr );
-
     }
     else
     {
@@ -513,6 +876,8 @@ osm_infr_rcv_process_set_method(
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
+/*********************************************************************
+**********************************************************************/
 void
 osm_infr_rcv_process(
   IN osm_infr_rcv_t*       const p_rcv,
@@ -543,3 +908,37 @@ osm_infr_rcv_process(
  Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
 }
+
+/*********************************************************************
+**********************************************************************/
+void
+osm_infir_rcv_process(
+  IN osm_infr_rcv_t*       const p_rcv,
+  IN const osm_madw_t*     const p_madw )
+{
+  ib_sa_mad_t *p_sa_mad;
+    
+  OSM_LOG_ENTER( p_rcv->p_log, osm_infr_rcv_process );
+
+  CL_ASSERT( p_madw );
+
+  p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw );
+
+  CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_INFORM_INFO_RECORD );
+
+  if ( (p_sa_mad->method != IB_MAD_METHOD_GET) &&
+       (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "osm_infir_rcv_process: "
+             "Unsupported Method (%s)\n",
+             ib_get_sa_method_str( p_sa_mad->method ) );
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR );
+    goto Exit;
+  }
+
+  osm_infr_rcv_process_get_method( p_rcv, p_madw );
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
diff --git a/osm/opensm/osm_sa_informinfo_ctrl.c b/osm/opensm/osm_sa_informinfo_ctrl.c
index 76fc402..1637155 100644
--- a/osm/opensm/osm_sa_informinfo_ctrl.c
+++ b/osm/opensm/osm_sa_informinfo_ctrl.c
@@ -33,7 +33,6 @@
  *
  */
 
-
 /*
  * Abstract:
  *    Implementation of osm_infr_rcv_ctrl_t.
@@ -68,12 +67,25 @@ __osm_infr_rcv_ctrl_disp_callback(
 
 /**********************************************************************
  **********************************************************************/
+static void
+__osm_infir_rcv_ctrl_disp_callback(
+  IN  void *context,
+  IN  void *p_data )
+{
+  /* ignore return status when invoked via the dispatcher */
+  osm_infir_rcv_process( ((osm_infr_rcv_ctrl_t*)context)->p_rcv,
+                         (osm_madw_t*)p_data );
+}
+
+/**********************************************************************
+ **********************************************************************/
 void
 osm_infr_rcv_ctrl_construct(
   IN osm_infr_rcv_ctrl_t* const p_ctrl )
 {
   memset( p_ctrl, 0, sizeof(*p_ctrl) );
   p_ctrl->h_disp = CL_DISP_INVALID_HANDLE;
+  p_ctrl->h_disp2 = CL_DISP_INVALID_HANDLE;
 }
 
 /**********************************************************************
@@ -83,6 +95,7 @@ osm_infr_rcv_ctrl_destroy(
   IN osm_infr_rcv_ctrl_t* const p_ctrl )
 {
   CL_ASSERT( p_ctrl );
+  cl_disp_unregister( p_ctrl->h_disp2 );
   cl_disp_unregister( p_ctrl->h_disp );
 }
 
@@ -119,6 +132,22 @@ osm_infr_rcv_ctrl_init(
     goto Exit;
   }
 
+  p_ctrl->h_disp2 = cl_disp_register(
+    p_disp,
+    OSM_MSG_MAD_INFORM_INFO_RECORD,
+    __osm_infir_rcv_ctrl_disp_callback,
+    p_ctrl );
+
+  if( p_ctrl->h_disp2 == CL_DISP_INVALID_HANDLE )
+  {
+    osm_log( p_log, OSM_LOG_ERROR,
+             "osm_infr_rcv_ctrl_init: ERR 1702: "
+             "Dispatcher registration failed\n" );
+    cl_disp_unregister( p_ctrl->h_disp );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
  Exit:
   OSM_LOG_EXIT( p_log );
   return( status );
diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c
index 56386b1..2605fbf 100644
--- a/osm/opensm/osm_sa_mad_ctrl.c
+++ b/osm/opensm/osm_sa_mad_ctrl.c
@@ -208,6 +208,10 @@ __osm_sa_mad_ctrl_process(
     msg_id = OSM_MSG_MAD_GUIDINFO_RECORD;
     break;
 
+  case IB_MAD_ATTR_INFORM_INFO_RECORD:
+    msg_id = OSM_MSG_MAD_INFORM_INFO_RECORD;
+    break;
+
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
   case IB_MAD_ATTR_MULTIPATH_RECORD:
     msg_id = OSM_MSG_MAD_MULTIPATH_RECORD;


From steve.apo at googlemail.com  Tue Dec  5 07:11:50 2006
From: steve.apo at googlemail.com (Steven Wooding)
Date: Tue, 5 Dec 2006 15:11:50 +0000
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could be
	wrong?
Message-ID: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>

Hi,

In my application I keep getting -1 returned by a call to ib_cm_send_req()
function. The cmpost example application works fine, so I can rule out
system set-up issues.

I could do with a glue as to what the -1 means and then hopefully correct my
application.

Thanks,

Steve.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/c771816f/attachment.html>

From tziporet at dev.mellanox.co.il  Tue Dec  5 07:13:09 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 05 Dec 2006 17:13:09 +0200
Subject: [openib-general] OFED release and Sonoma OFA developers workshop
Message-ID: <45758C85.4040903@dev.mellanox.co.il>

Hi Bill,

Since there is no Intel IDF on March 07, and on March we are going to be 
in the middle of OFED 1.2 release I suggest to delay the developer's 
conference to May.

It will also be very good to have the workshop *after *the release since 
it will enable us to understand what went good, and what need to be 
improved in the process.

Any thoughts?

Tziporet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/d4fb0a4d/attachment.html>

From swise at opengridcomputing.com  Tue Dec  5 07:14:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:14:36 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <aday7pmgbf6.fsf@cisco.com>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
	<20061205051657.GB26845@2ka.mipt.ru> <aday7pmgbf6.fsf@cisco.com>
Message-ID: <1165331676.16087.29.camel@stevo-desktop>

On Mon, 2006-12-04 at 21:27 -0800, Roland Dreier wrote:
>  > So will each new NIC implement some parts of TCP stack in theirs drivers?
> 
> I hope not.  The driver we merged (amso1100) did it completely in FW,
> with a separate MAC and IP interface for the RDMA connections.  I
> think we better understand the Chelsio driver pretty well and think it
> over carefully before we merge it.
> 

Chelsio doesn't implement TCP stack in the driver.  Just like Ammasso,
it sends messages to the HW to setup connections.  It differs from
Ammasso in at least 2 ways:

1) Ammasso does the MPA negotiations in FW/HW.  Chelsio does it in the
RDMA driver.  So there is code in the Chelsio driver to handle MPA
startup negotiation (the exchange of 2 packets over the TCP connection
while its still in streaming more).  BTW: This code _could_ be moved
into the core IWCM if we find it could be used by other rnic devices
(don't know yet).

2) Ammasso implments a 100% deep adapter.  It does ARP, routing, IP,
TCP, and IWARP protocols all in firmware/hw.  It had 2 mac addresses
simulating 2 ethernet ports.  One exclusively for RDMA connections, and
one for host stack traffic.  Chelsio implements a shallower adapter that
only does TCP in HW.  ARP, for instance, is handled by the native stack
and the rdma driver uses netevents to maintain arp tables in the HW for
use by the offloaded TCP connections.

Steve.


From johnpol at 2ka.mipt.ru  Tue Dec  5 07:19:06 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 18:19:06 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165330925.16087.13.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
Message-ID: <20061205151905.GA18275@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 09:02:05AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > >  > This and a lot of other changes in this driver definitely says you
> > >  > implement your own stack of protocols on top of infiniband hardware.
> > > 
> > > ...but I do know this driver is for 10-gig ethernet HW.
> > 
> > It is for iwarp/rdma from description.
> > If it is 10ge, then why does it parse incomping packet headers and
> > implements initial tcp state machine?
> > 
> 
> Its not implementing the TCP state machine at all. Its implementing the
> MPA state machine (see the iWARP internet drafts).  These packets are
> TCP payload.  MPA is used to negotiate RDMA mode on a TCP connection.
> This entails an exchange of 2 messages on the TCP connection.  Once this
> is exchanged and both side agree, the connection is bound to an RDMA QP
> and the connection moved into RDMA mode.  From that point on, all IO is
> done via the post_send() and post_recv().

And why does rdma require window scaling, keep alive, nagle and other
interesting options from TCP spec?

This really looks like initial implementation of TCP in hardware - you
setup flags like doing the same using setsockopt() and then hardware
manages the flow like network stack manages TCP state machine changes.

According to draft-culley-iwarp-mpa-03.txt this layer can do a lot of
things with valid TCP flow like

   5.  The TCP sender puts the FPDUs into the TCP stream.  If the TCP
       Sender is MPA-aware, it segments the TCP stream in such a way
       that a TCP Segment boundary is also the boundary of an FPDU.  
       TCP then passes each segment to the IP layer for transmission.

Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying
that hardware (even if it is called ethernet driver) can create and work
with own TCP flows potentially modified in the way it likes which is seen 
in driver. Likely such flows will not be seen by upper layers like OS 
network stack according to hardware descriptions.

Is it correct?

> Steve. 

-- 
	Evgeniy Polyakov


From johnpol at 2ka.mipt.ru  Tue Dec  5 07:27:36 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 18:27:36 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165331676.16087.29.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
	<20061205051657.GB26845@2ka.mipt.ru> <aday7pmgbf6.fsf@cisco.com>
	<1165331676.16087.29.camel@stevo-desktop>
Message-ID: <20061205152736.GA2274@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 09:14:36AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> Chelsio doesn't implement TCP stack in the driver.  Just like Ammasso,
> it sends messages to the HW to setup connections.  It differs from
> Ammasso in at least 2 ways:
> 
> 1) Ammasso does the MPA negotiations in FW/HW.  Chelsio does it in the
> RDMA driver.  So there is code in the Chelsio driver to handle MPA
> startup negotiation (the exchange of 2 packets over the TCP connection
> while its still in streaming more).  BTW: This code _could_ be moved
> into the core IWCM if we find it could be used by other rnic devices
> (don't know yet).
> 
> 2) Ammasso implments a 100% deep adapter.  It does ARP, routing, IP,
> TCP, and IWARP protocols all in firmware/hw.  It had 2 mac addresses
> simulating 2 ethernet ports.  One exclusively for RDMA connections, and
> one for host stack traffic.  Chelsio implements a shallower adapter that
> only does TCP in HW.  ARP, for instance, is handled by the native stack
> and the rdma driver uses netevents to maintain arp tables in the HW for
> use by the offloaded TCP connections.

So breifly saying - there is TCP stack implementation (including ARP and
routing and other parts) in hardware/firmware/driver which is guaranteed
to not be visible to host other than in form of high-level dataflow.
Am I right here?

> Steve.
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
	Evgeniy Polyakov


From swise at opengridcomputing.com  Tue Dec  5 07:39:58 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:39:58 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205151905.GA18275@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
Message-ID: <1165333198.16087.53.camel@stevo-desktop>

On Tue, 2006-12-05 at 18:19 +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 05, 2006 at 09:02:05AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > > >  > This and a lot of other changes in this driver definitely says you
> > > >  > implement your own stack of protocols on top of infiniband hardware.
> > > > 
> > > > ...but I do know this driver is for 10-gig ethernet HW.
> > > 
> > > It is for iwarp/rdma from description.
> > > If it is 10ge, then why does it parse incomping packet headers and
> > > implements initial tcp state machine?
> > > 
> > 
> > Its not implementing the TCP state machine at all. Its implementing the
> > MPA state machine (see the iWARP internet drafts).  These packets are
> > TCP payload.  MPA is used to negotiate RDMA mode on a TCP connection.
> > This entails an exchange of 2 messages on the TCP connection.  Once this
> > is exchanged and both side agree, the connection is bound to an RDMA QP
> > and the connection moved into RDMA mode.  From that point on, all IO is
> > done via the post_send() and post_recv().
> 
> And why does rdma require window scaling, keep alive, nagle and other
> interesting options from TCP spec?
> 

The connection setup messages sent to the hardware need to have these
parameters so the TCP engine on the HW knows how to do connection
options, windows, etc.

> This really looks like initial implementation of TCP in hardware - you
> setup flags like doing the same using setsockopt() and then hardware
> manages the flow like network stack manages TCP state machine changes.
> 
> According to draft-culley-iwarp-mpa-03.txt this layer can do a lot of
> things with valid TCP flow like
> 
>    5.  The TCP sender puts the FPDUs into the TCP stream.  If the TCP
>        Sender is MPA-aware, it segments the TCP stream in such a way
>        that a TCP Segment boundary is also the boundary of an FPDU.  
>        TCP then passes each segment to the IP layer for transmission.
> 
> Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying
> that hardware (even if it is called ethernet driver) can create and work
> with own TCP flows potentially modified in the way it likes which is seen 
> in driver. Likely such flows will not be seen by upper layers like OS 
> network stack according to hardware descriptions.
> 
> Is it correct?
> 

I don't quite get your point about the driver aspect of this?

The HW manages the iWARP connection including data flow.  It adheres to
the MPA, RDDP, and RDMAP protocol specification IDs from the IETF.  The
HW manages how data gets pushed out in the RDMA stream.   The RDMA
Driver just requests a TCP connection and does the MPA exchange.  Then
tells the hardware to move the connection into RDMA mode.  From that
point on, the driver simply suffles IO work requests from the consumer
application to the hardware and handles asynchronous events while the
connection is up and running.

Steve.


From johann.george at qlogic.com  Tue Dec  5 07:43:50 2006
From: johann.george at qlogic.com (Johann George)
Date: Tue, 5 Dec 2006 07:43:50 -0800
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <45755DF1.5080208@dev.mellanox.co.il>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
	<45755DF1.5080208@dev.mellanox.co.il>
Message-ID: <20061205154350.GA11109@cuprite.pathscale.com>

> Who controls the DNS for openfabrics.org?

At the moment, I believe that Intel does.

> Could we get these names created?

Could you send me a list of the names you would like created and I will try
to initiate the process.

Johann


From swise at opengridcomputing.com  Tue Dec  5 07:46:18 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 09:46:18 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205152736.GA2274@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru> <ada3b7uhqlk.fsf@cisco.com>
	<20061205051657.GB26845@2ka.mipt.ru> <aday7pmgbf6.fsf@cisco.com>
	<1165331676.16087.29.camel@stevo-desktop>
	<20061205152736.GA2274@2ka.mipt.ru>
Message-ID: <1165333578.16087.60.camel@stevo-desktop>

On Tue, 2006-12-05 at 18:27 +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 05, 2006 at 09:14:36AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > Chelsio doesn't implement TCP stack in the driver.  Just like Ammasso,
> > it sends messages to the HW to setup connections.  It differs from
> > Ammasso in at least 2 ways:
> > 
> > 1) Ammasso does the MPA negotiations in FW/HW.  Chelsio does it in the
> > RDMA driver.  So there is code in the Chelsio driver to handle MPA
> > startup negotiation (the exchange of 2 packets over the TCP connection
> > while its still in streaming more).  BTW: This code _could_ be moved
> > into the core IWCM if we find it could be used by other rnic devices
> > (don't know yet).
> > 
> > 2) Ammasso implments a 100% deep adapter.  It does ARP, routing, IP,
> > TCP, and IWARP protocols all in firmware/hw.  It had 2 mac addresses
> > simulating 2 ethernet ports.  One exclusively for RDMA connections, and
> > one for host stack traffic.  Chelsio implements a shallower adapter that
> > only does TCP in HW.  ARP, for instance, is handled by the native stack
> > and the rdma driver uses netevents to maintain arp tables in the HW for
> > use by the offloaded TCP connections.
> 
> So breifly saying - there is TCP stack implementation (including ARP and
> routing and other parts) in hardware/firmware/driver which is guaranteed
> to not be visible to host other than in form of high-level dataflow.
> Am I right here?

For Ammasso, yes.  


From bugzilla-daemon at openib.org  Tue Dec  5 07:47:38 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Tue,  5 Dec 2006 07:47:38 -0800 (PST)
Subject: [openib-general] [Bug 308] New: IPOIB HA Failed - ping does not
	reach to destination
Message-ID: <20061205154738.DB4A82283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=308

           Summary: IPOIB HA Failed - ping does not reach to destination
           Product: OpenFabrics Linux
           Version: gen2
          Platform: Other
        OS/Version: RHEL 4
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: IPoIB
        AssignedTo: bugzilla at openib.org
        ReportedBy: yohadd at mellanox.co.il


IPOIB HA Failed - ping does not reach to destination.

Failure flow:
1) set HA up on host1. primary=ib0, secondary=ib1.
2) run opensm on host2.
3) run ping to the ip that associated with host1 ib0. - ping succeed.
4) set the port that associated with ib0 on host1 down. - ping starts to fail.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From jsquyres at cisco.com  Tue Dec  5 07:57:03 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 5 Dec 2006 10:57:03 -0500
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <20061205154350.GA11109@cuprite.pathscale.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
	<45755DF1.5080208@dev.mellanox.co.il>
	<20061205154350.GA11109@cuprite.pathscale.com>
Message-ID: <1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com>

How about the following:

git.openfabrics.org
wiki.openfabrics.org
trac.openfabrics.org
ssh.openfabrics.org

I'm assuming that these can all be CNAMEs to the main name.

(Since Intel is maintaining this, should we be bugging someone else  
instead of you?)


On Dec 5, 2006, at 10:43 AM, Johann George wrote:

>> Who controls the DNS for openfabrics.org?
>
> At the moment, I believe that Intel does.
>
>> Could we get these names created?
>
> Could you send me a list of the names you would like created and I  
> will try to initiate the process.
>
> Johann


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From swise at opengridcomputing.com  Tue Dec  5 08:02:09 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 10:02:09 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <45754DE3.1020505@ens-lyon.org>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
	<45754DE3.1020505@ens-lyon.org>
Message-ID: <1165334529.16087.69.camel@stevo-desktop>

On Tue, 2006-12-05 at 11:45 +0100, Brice Goglin wrote:
> Steve Wise wrote:
> > There is no SW TCP stack in this driver.  The HW supports RDMA over
> > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
> > (aka iWARP).  The device is a 10 GbE device, not Infiniband.
> 
> Then, I wonder why the driver goes in drivers/infiniband/ :)

drivers/infiniband support both IB and IWARP transports.

> Is there really no way to only keep the actual hw infiniband there, move
> iwarp/rdma drivers in drivers/net/something/ and the core stuff in
> net/something/ ?
> 

Sure, this _could_ be done, but what I think you're missing is that
applications use the interface exported by drivers/infiniband over both
IB -and- IWARP transports.  The application can be written to not care
which transport is used.   Examples of apps that can run over both
transports using the same common interface: 

user mode: MVAPICH2, OMPI, IMPI, HPMPI, 
kernel mode: NFS-RDMA, iSER.  

Note that the include directory used by drivers/infiniband is now
include/rdma.  Perhaps drivers/infiniband should be renamed to
drivers/rdma as well at some point...


Steve.


From johnpol at 2ka.mipt.ru  Tue Dec  5 07:59:32 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 18:59:32 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165333198.16087.53.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
Message-ID: <20061205155932.GA32380@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying
> > that hardware (even if it is called ethernet driver) can create and work
> > with own TCP flows potentially modified in the way it likes which is seen 
> > in driver. Likely such flows will not be seen by upper layers like OS 
> > network stack according to hardware descriptions.
> > 
> > Is it correct?
> > 
> 
> I don't quite get your point about the driver aspect of this?
> 
> The HW manages the iWARP connection including data flow.  It adheres to
> the MPA, RDDP, and RDMAP protocol specification IDs from the IETF.  The
> HW manages how data gets pushed out in the RDMA stream.   The RDMA
> Driver just requests a TCP connection and does the MPA exchange.  Then
> tells the hardware to move the connection into RDMA mode.  From that
> point on, the driver simply suffles IO work requests from the consumer
> application to the hardware and handles asynchronous events while the
> connection is up and running.

My main concern about this is the fact, that protocol handling is
splitted into SF and HW parts, and actually until negotiation is
completed those parts are completely unrelated to each other, so
requested TCP connection can leak into main stack and main stack can
send some packets which can be considered as MPA negotiation.

> Steve.

-- 
	Evgeniy Polyakov


From johann.george at qlogic.com  Tue Dec  5 08:07:52 2006
From: johann.george at qlogic.com (Johann George)
Date: Tue, 5 Dec 2006 08:07:52 -0800
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
	<45755DF1.5080208@dev.mellanox.co.il>
	<20061205154350.GA11109@cuprite.pathscale.com>
	<1C4EC796-9CD7-4962-BC4E-F76B5443E624@cisco.com>
Message-ID: <20061205160752.GA11809@cuprite.pathscale.com>

> git.openfabrics.org
> wiki.openfabrics.org
> trac.openfabrics.org
> ssh.openfabrics.org

Sounds good.

> (Since Intel is maintaining this, should we be bugging someone else  
> instead of you?)

Ideally, yes; but I will be happy to initiate it.  Also, we probably should
move control of the domain name to OpenFabrics.

Johann


From swise at opengridcomputing.com  Tue Dec  5 08:12:42 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 10:12:42 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205155932.GA32380@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
Message-ID: <1165335162.16087.79.camel@stevo-desktop>

On Tue, 2006-12-05 at 18:59 +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying
> > > that hardware (even if it is called ethernet driver) can create and work
> > > with own TCP flows potentially modified in the way it likes which is seen 
> > > in driver. Likely such flows will not be seen by upper layers like OS 
> > > network stack according to hardware descriptions.
> > > 
> > > Is it correct?
> > > 
> > 
> > I don't quite get your point about the driver aspect of this?
> > 
> > The HW manages the iWARP connection including data flow.  It adheres to
> > the MPA, RDDP, and RDMAP protocol specification IDs from the IETF.  The
> > HW manages how data gets pushed out in the RDMA stream.   The RDMA
> > Driver just requests a TCP connection and does the MPA exchange.  Then
> > tells the hardware to move the connection into RDMA mode.  From that
> > point on, the driver simply suffles IO work requests from the consumer
> > application to the hardware and handles asynchronous events while the
> > connection is up and running.
> 
> My main concern about this is the fact, that protocol handling is
> splitted into SF and HW parts, and actually until negotiation is
> completed those parts are completely unrelated to each other, so
> requested TCP connection can leak into main stack and main stack can
> send some packets which can be considered as MPA negotiation.
> 

Ah.  Data from an offloaded connection cannot leak into the main stack
nor vice-verse.  We can take an active RDMA connection establishment as
an example if you want:  Once the message is sent to the HW to "setup a
TCP connection from addr/port a.b to addr/port c.d", then packets on
that connection (that 4-tuple) will always be delivered to the RDMA
driver, not the native stack.  If the the packet received after the
connection is setup is -not- an MPA reply (in this example), then the
connection is aborted.  Once the connection is aborted.  So no leaking
can happen.


From tziporet at dev.mellanox.co.il  Tue Dec  5 08:17:16 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 05 Dec 2006 18:17:16 +0200
Subject: [openib-general] OFED 1.2 features update
Message-ID: <45759B8C.8010408@dev.mellanox.co.il>

Hi,
In the OFED meeting yesterday the following decisions were taken:

1.  We agreed to have two types of features

    * Must have features - will delay the release if not ready
    * Desirable features - will be included only if they are ready on
      time according to OFED requirements.

2. The following features are added to OFED 1.2 as desired:

   1. iWARP - someone from iWARP company should be the owner
   2. VNIC - Madhue

    OFED 1.2 plan was updated on the Wiki: 
https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features

3. NFSoverRDMA:

    Will probably not be part of OFED 1.2 since it requires kernel pathces.
    Tom Tucker will prepare a package that will be installed over OFED 1.2

4. Sean should prepare patches or git tree for kernel code that is not 
upstream (e.g. SA cache)
5. Hal will take care for git-commit mails
6. Tziporet should send explanation on OFED inclusion requirements 
(backport patches, install scripts, etc.)

Tziporet


From swise at opengridcomputing.com  Tue Dec  5 08:17:43 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 10:17:43 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165335162.16087.79.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
Message-ID: <1165335463.16087.83.camel@stevo-desktop>

On Tue, 2006-12-05 at 10:12 -0600, Steve Wise wrote:
> On Tue, 2006-12-05 at 18:59 +0300, Evgeniy Polyakov wrote:
> > On Tue, Dec 05, 2006 at 09:39:58AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > > > Phrases like "MPA-aware TCP" rises a lot of questions - briefly saying
> > > > that hardware (even if it is called ethernet driver) can create and work
> > > > with own TCP flows potentially modified in the way it likes which is seen 
> > > > in driver. Likely such flows will not be seen by upper layers like OS 
> > > > network stack according to hardware descriptions.
> > > > 
> > > > Is it correct?
> > > > 
> > > 
> > > I don't quite get your point about the driver aspect of this?
> > > 
> > > The HW manages the iWARP connection including data flow.  It adheres to
> > > the MPA, RDDP, and RDMAP protocol specification IDs from the IETF.  The
> > > HW manages how data gets pushed out in the RDMA stream.   The RDMA
> > > Driver just requests a TCP connection and does the MPA exchange.  Then
> > > tells the hardware to move the connection into RDMA mode.  From that
> > > point on, the driver simply suffles IO work requests from the consumer
> > > application to the hardware and handles asynchronous events while the
> > > connection is up and running.
> > 
> > My main concern about this is the fact, that protocol handling is
> > splitted into SF and HW parts, and actually until negotiation is
> > completed those parts are completely unrelated to each other, so
> > requested TCP connection can leak into main stack and main stack can
> > send some packets which can be considered as MPA negotiation.
> > 
> 
> Ah.  Data from an offloaded connection cannot leak into the main stack
> nor vice-verse.  We can take an active RDMA connection establishment as
> an example if you want:  Once the message is sent to the HW to "setup a
> TCP connection from addr/port a.b to addr/port c.d", then packets on
> that connection (that 4-tuple) will always be delivered to the RDMA
> driver, not the native stack.  If the the packet received after the
> connection is setup is -not- an MPA reply (in this example), then the
> connection is aborted.  Once the connection is aborted.  
                                                       ^ the 4 tuple can
then be reused for rdma or native stack tcp connections.


From mst at mellanox.co.il  Tue Dec  5 08:19:44 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 5 Dec 2006 18:19:44 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <20061129140016.GO5061@mellanox.co.il>
References: <20061129140016.GO5061@mellanox.co.il>
Message-ID: <20061205161944.GD30209@mellanox.co.il>

The following patch adds experimental support for IPoIB connected mode.
The idea is to increase performance by increasing the MTU
from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD.
With this code, I'm able to get 800MByte/sec or more with netperf
without options on a Mellanox 4x back-to-back DDR system.

Please review.

I labeled CM support as experimental, although its been very stable for me,
mostly because there are still some things to be addressed before it's as usable
as IPoIB UD. I am very interested in getting this code in shape for merging as
early as possible, as opposed to maintaining it out of tree until it's fully
mature, and I tried to split the CM code in a separate file to make this
feasible.

Let me know whether this was a good idea, or whether more needs to be done
in this direction.

Note that the connected mode support adds very little overhead when not activated
at run time, and zero data-path overhead when not activated at compile time.
Here's a short description of what the patch does:

a. The code's here:
git://staging.openfabrics.org/~mst/linux-2.6/.git ipoib_cm_branch
This is based on 2.6.19, so
~>git diff v2.6.19..ipoib_cm_branch
will show what I have done so far.

b. How to activate:
Server:
#modprobe ib_ipoib
#/sbin/ifconfig ib0 mtu 65520
#./netperf-2.4.2/src/netserver

Client:
#modprobe ib_ipoib
#/sbin/ifconfig ib0 mtu 65520
#./netperf-2.4.2/src/netperf -H 11.4.3.68 -f M
	TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68)
	port 0 AF_INET : demo
	Recv   Send    Send
	Socket Socket  Message  Elapsed
	Size   Size    Size     Time     Throughput
	bytes  bytes   bytes    secs.    MBytes/sec

	87380  16384  16384    10.01     891.21

c. TODO list
1. Clean up stale connections
4. (Optional) S/G support
5. (Optional) Make CM use same CQ IPoIB uses for UD

d. Limitations
UDP multicast and UDP connections to IPoIB UD mode
currently don't work since we get packets that are too large to
send over a UD QP.
As a work around, one can now create separate interfaces
for use with CM and UD mode.

e. Some notes on code
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction,
   so each connection is either active(TX) or passive (RX).
   2 sides that want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
   this keeps the code simple without CQ resize and other tricks

I'm looking at ways to limit the path mtu
for these connections, to make it work.


Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig
index c75322d..7aa3a25 100644
--- a/drivers/infiniband/ulp/ipoib/Kconfig
+++ b/drivers/infiniband/ulp/ipoib/Kconfig
@@ -8,6 +8,15 @@ config INFINIBAND_IPOIB
 
 	  See Documentation/infiniband/ipoib.txt for more information
 
+config INFINIBAND_IPOIB_CM
+	bool "IP-over-InfiniBand Connected Mode support"
+	depends on INFINIBAND_IPOIB && EXPERIMENTAL
+	default n
+	---help---
+	  This option enables experimental support for IPoIB connected mode.
+	  After enabling this option, you need to increase the interface MTU
+	  with e.g. ifconfig ib0 mtu 65520 to actually create connections.
+
 config INFINIBAND_IPOIB_DEBUG
 	bool "IP-over-InfiniBand debugging" if EMBEDDED
 	depends on INFINIBAND_IPOIB
diff --git a/drivers/infiniband/ulp/ipoib/Makefile b/drivers/infiniband/ulp/ipoib/Makefile
index 8935e74..f01a24b 100644
--- a/drivers/infiniband/ulp/ipoib/Makefile
+++ b/drivers/infiniband/ulp/ipoib/Makefile
@@ -6,4 +6,5 @@ ib_ipoib-y					:= ipoib_main.o \
 						   ipoib_verbs.o \
 						   ipoib_vlan.o
 ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG)	+= ipoib_fs.o
+ib_ipoib-$(INFINIBAND_IPOIB_CM)			+= ipoib_cm.o
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 0b8a79d..545cdae 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -62,6 +62,9 @@ enum {
 
 	IPOIB_ENCAP_LEN 	  = 4,
 
+	IPOIB_CM_MTU              = 0x10000 - 0x10, /* padding to align header to 16 */
+	IPOIB_CM_BUF_SIZE         = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,
+
 	IPOIB_RX_RING_SIZE 	  = 128,
 	IPOIB_TX_RING_SIZE 	  = 64,
 	IPOIB_MAX_QUEUE_SIZE	  = 8192,
@@ -81,6 +84,7 @@ enum {
 	IPOIB_MCAST_RUN 	  = 6,
 	IPOIB_STOP_REAPER         = 7,
 	IPOIB_MCAST_STARTED       = 8,
+	IPOIB_FLAG_NETIF_STOPPED  = 9,
 
 	IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -113,6 +117,49 @@ struct ipoib_tx_buf {
 	DECLARE_PCI_UNMAP_ADDR(mapping)
 };
 
+struct ib_cm_id;
+
+struct ipoib_cm_data {
+	__be32 qpn; /* High byte MUST be ignored on receive */
+	__be32 mtu;
+};
+
+struct ipoib_cm_rx {
+	struct ib_cm_id     *id;
+	struct ib_qp        *qp;
+	struct list_head     list;
+	struct net_device   *dev;
+};
+
+struct ipoib_cm_tx {
+	struct ib_cm_id     *id;
+	struct ib_cq        *cq;
+	struct ib_qp        *qp;
+	struct list_head     list;
+	struct net_device   *dev;
+	struct ipoib_neigh  *neigh;
+	struct ipoib_path   *path;
+	struct ipoib_tx_buf *tx_ring;
+	unsigned             tx_head;
+	unsigned             tx_tail;
+	unsigned long        flags;
+	u32                  mtu;
+	struct ib_wc         ibwc[IPOIB_NUM_WC];
+};
+
+struct ipoib_cm_dev_priv {
+	struct ib_cq  	    *cq;
+	struct ib_srq  	    *srq;
+	struct ipoib_rx_buf *srq_ring;
+	struct ib_cm_id     *id;
+	struct list_head     passive_ids;
+	struct work_struct   start_task;
+	struct work_struct   reap_task;
+	struct list_head     start_list;
+	struct list_head     reap_list;
+	struct ib_wc         ibwc[IPOIB_NUM_WC];
+};
+
 /*
  * Device private locking: tx_lock protects members used in TX fast
  * path (and we use LLTX so upper layers don't do extra locking).
@@ -179,6 +226,8 @@ struct ipoib_dev_priv {
 	struct list_head child_intfs;
 	struct list_head list;
 
+	struct ipoib_cm_dev_priv cm;
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 	struct list_head fs_list;
 	struct dentry *mcg_dentry;
@@ -212,6 +261,7 @@ struct ipoib_path {
 
 struct ipoib_neigh {
 	struct ipoib_ah    *ah;
+	struct ipoib_cm_tx *cm;
 	union ib_gid        dgid;
 	struct sk_buff_head queue;
 
@@ -315,6 +365,93 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 void ipoib_pkey_poll(void *dev);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
 
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+
+#define IPOIB_FLAGS_RC          0x80
+#define IPOIB_FLAGS_UC          0x40
+
+#define IPOIB_CM_ENABLED(ha)   (ha[0] & IPOIB_FLAGS_RC)
+
+static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n)
+{
+	/* Simple heuristic: dev->mtu > 2K ==> connected mode */
+	return (IPOIB_CM_ENABLED(n->ha) &&
+		dev->mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN);
+}
+
+static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh)
+{
+	return neigh->cm;
+}
+
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx);
+int ipoib_cm_dev_open(struct net_device *dev);
+void ipoib_cm_dev_stop(struct net_device *dev);
+int ipoib_cm_dev_init(struct net_device *dev);
+void ipoib_cm_dev_cleanup(struct net_device *dev);
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				    struct ipoib_neigh *neigh);
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
+#else
+
+#define IPOIB_CM_ENABLED(ha)   (0)
+
+static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n)
+
+{
+	return 0;
+}
+
+static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh)
+{
+	return NULL;
+}
+
+static inline
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
+{
+	return;
+}
+
+static inline
+int ipoib_cm_dev_open(struct net_device *dev)
+{
+	return 0;
+}
+
+static inline
+void ipoib_cm_dev_stop(struct net_device *dev)
+{
+	return; 
+}
+
+static inline
+int ipoib_cm_dev_init(struct net_device *dev)
+{
+	return 0;
+}
+
+static inline
+void ipoib_cm_dev_cleanup(struct net_device *dev)
+{
+	return;
+}
+
+static inline
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				    struct ipoib_neigh *neigh)
+{
+	return NULL;
+}
+
+static inline
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
+{
+	return;
+}
+
+#endif
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 void ipoib_create_debug_files(struct net_device *dev);
 void ipoib_delete_debug_files(struct net_device *dev);
@@ -392,4 +529,7 @@ extern int ipoib_debug_level;
 
 #define IPOIB_GID_ARG(gid)	IPOIB_GID_RAW_ARG((gid).raw)
 
+#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
+
+
 #endif /* _IPOIB_H */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
new file mode 100644
index 0000000..a40eb4c
--- /dev/null
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -0,0 +1,1043 @@
+/*
+ * Copyright (c) 2006 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+#include <rdma/ib_cm.h>
+#include <rdma/ib_cache.h>
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
+static int data_debug_level;
+
+module_param_named(cm_data_debug_level, data_debug_level, int, 0644);
+MODULE_PARM_DESC(cm_data_debug_level,
+		 "Enable data path debug tracing for connected mode if > 0");
+#endif
+
+#include "ipoib.h"
+
+#define IPOIB_CM_IETF_ID 0x1000000000000000ULL
+
+#define	IPOIB_OP_SRQ	(1ul << 30)
+
+struct ipoib_cm_id {
+	struct ib_cm_id *id;
+	int flags;
+	u32 remote_qpn;
+	u32 remote_mtu;
+};
+
+int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+
+static int ipoib_cm_post_receive(struct net_device *dev, int id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sge list;
+	struct ib_recv_wr param;
+	struct ib_recv_wr *bad_wr;
+	int ret;
+
+	list.addr     = priv->cm.srq_ring[id].mapping;
+	list.length   = IPOIB_CM_BUF_SIZE;
+	list.lkey     = priv->mr->lkey;
+
+	param.next    = NULL;
+	param.wr_id   = id | IPOIB_OP_SRQ;
+	param.sg_list = &list;
+	param.num_sge = 1;
+
+	ret = ib_post_srq_recv(priv->cm.srq, &param, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret);
+		dma_unmap_single(priv->ca->dma_device,
+				 priv->cm.srq_ring[id].mapping,
+				 IPOIB_CM_BUF_SIZE, DMA_FROM_DEVICE);
+		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
+		priv->cm.srq_ring[id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	dma_addr_t addr;
+
+	skb = dev_alloc_skb(IPOIB_CM_BUF_SIZE + 12);
+	if (!skb)
+		return -ENOMEM;
+
+	/*
+	 * IPoIB adds a 4 byte header. So we need 12 more bytes to align the
+	 * IP header to a multiple of 16.
+	 */
+	skb_reserve(skb, 12);
+
+	addr = dma_map_single(priv->ca->dma_device,
+			      skb->data, IPOIB_CM_BUF_SIZE,
+			      DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(addr))) {
+		dev_kfree_skb_any(skb);
+		return -EIO;
+	}
+
+	priv->cm.srq_ring[id].skb     = skb;
+	priv->cm.srq_ring[id].mapping = addr;
+
+	return 0;
+}
+
+static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr attr = {
+		.send_cq = priv->cm.cq, /* does not matter, we never send anything */
+		.recv_cq = priv->cm.cq,
+		.srq = priv->cm.srq,
+		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type = IB_QPT_RC,
+	};
+	return ib_create_qp(priv->pd, &attr);
+}
+
+static int ipoib_cm_modify_rx_rts(struct net_device *dev,
+				  struct ib_cm_id *cm_id, struct ib_qp *qp)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for INIT: %d\n", ret);
+		return ret;
+	}
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to INIT: %d\n", ret);
+		return ret;
+	}
+	qp_attr.qp_state = IB_QPS_RTR;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret);
+		return ret;
+	}
+	qp_attr.rq_psn = 0 /* FIXME */;
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id,
+			     struct ib_qp *qp, struct ib_cm_req_event_param *req)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_data data = {};
+	struct ib_cm_rep_param rep = {};
+
+	data.qpn = cpu_to_be32(priv->qp->qp_num);
+	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+
+	rep.private_data = &data;
+	rep.private_data_len = sizeof data;
+	rep.flow_control = 0;
+	rep.rnr_retry_count = req->rnr_retry_count;
+	rep.target_ack_delay = 20; /* FIXME */
+	rep.srq = 1;
+	rep.qp_num = qp->qp_num;
+	rep.starting_psn = 0 /* FIXME */;
+	return ib_send_cm_rep(cm_id, &rep);
+}
+
+static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_rx *p;
+	unsigned long flags;
+	int ret;
+
+	ipoib_dbg(priv, "REQ arrived\n");
+	p = kzalloc(sizeof *p, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+	p->dev = dev;
+	p->id = cm_id;
+	p->qp = ipoib_cm_create_rx_qp(dev);
+	if (IS_ERR(p->qp)) {
+		ret = PTR_ERR(p->qp);
+		goto err_qp;
+	}
+
+	ret = ipoib_cm_modify_rx_rts(dev, cm_id, p->qp);
+	if (ret)
+		goto err_modify;
+
+	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd);
+	if (ret) {
+		ipoib_warn(priv, "failed to send REP: %d\n", ret);
+		goto err_rep;
+	}
+
+	cm_id->context = p;
+	spin_lock_irqsave(&priv->lock, flags);
+	list_add(&p->list, &priv->cm.passive_ids);
+	spin_unlock_irqrestore(&priv->lock, flags);
+	return 0;
+
+err_rep:
+err_modify:
+	ib_destroy_qp(p->qp);
+err_qp:
+	kfree(p);
+	return ret;
+}
+
+int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_rx *p;
+	struct ipoib_dev_priv *priv;
+	unsigned long flags;
+	int ret;
+
+	switch (event->event) {
+	case IB_CM_REQ_RECEIVED:
+		return ipoib_cm_req_handler(cm_id, event);
+	case IB_CM_DREQ_RECEIVED:
+		p = cm_id->context;
+		ib_send_cm_drep(cm_id, NULL, 0);
+		/* Fall through */
+	case IB_CM_REJ_RECEIVED:
+		p = cm_id->context;
+		priv = netdev_priv(p->dev);
+		spin_lock_irqsave(&priv->lock, flags);
+		if (list_empty(&p->list))
+	       		ret = 0; /* Connection is going away already. */
+		else {
+			list_del(&p->list);
+			ret = -ECONNRESET;
+		}
+		spin_unlock_irqrestore(&priv->lock, flags);
+		if (ret) {
+			ib_destroy_qp(p->qp);
+			kfree(p);
+			return ret;
+		}
+		return 0;
+	default:
+		return 0;
+	}
+}
+
+static void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_SRQ;
+	struct sk_buff *skb;
+	dma_addr_t addr;
+
+	ipoib_dbg_data(priv, "cm recv completion: id %d, op %d, status: %d\n",
+		       wr_id, wc->opcode, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+			   wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	skb  = priv->cm.srq_ring[wr_id].skb;
+	addr = priv->cm.srq_ring[wr_id].mapping;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		++priv->stats.rx_dropped;
+		goto repost;
+	}
+
+	/*
+	 * If we can't allocate a new RX buffer, dump
+	 * this packet and reuse the old buffer.
+	 */
+	if (unlikely(ipoib_cm_alloc_rx_skb(dev, wr_id))) {
+		++priv->stats.rx_dropped;
+		goto repost;
+	}
+
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	dma_unmap_single(priv->ca->dma_device, addr,
+			 IPOIB_CM_BUF_SIZE, DMA_FROM_DEVICE);
+
+	skb_put(skb, wc->byte_len);
+
+	if (wc->slid != priv->local_lid ||
+	    wc->src_qp != priv->qp->qp_num) {
+		skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+		skb->mac.raw = skb->data;
+		skb_pull(skb, IPOIB_ENCAP_LEN);
+
+		dev->last_rx = jiffies;
+		++priv->stats.rx_packets;
+		priv->stats.rx_bytes += skb->len;
+
+		skb->dev = dev;
+		/* XXX get correct PACKET_ type here */
+		skb->pkt_type = PACKET_HOST;
+		netif_rx_ni(skb);
+	} else {
+		ipoib_dbg_data(priv, "dropping loopback packet\n");
+		dev_kfree_skb_any(skb);
+	}
+
+repost:
+	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
+		ipoib_warn(priv, "ipoib_cm_post_receive failed "
+			   "for buf %d\n", wr_id);
+}
+
+void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr)
+{
+	struct net_device *dev = (struct net_device *) dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int n, i;
+
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	do {
+		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->cm.ibwc);
+		for (i = 0; i < n; ++i)
+			ipoib_cm_handle_rx_wc(dev, priv->cm.ibwc + i);
+	} while (n == IPOIB_NUM_WC);
+}
+
+static inline int post_send(struct ipoib_dev_priv *priv,
+			    struct ipoib_cm_tx *tx,
+			    unsigned int wr_id,
+			    dma_addr_t addr, int len)
+{
+	struct ib_send_wr *bad_wr;
+
+	priv->tx_sge.addr             = addr;
+	priv->tx_sge.length           = len;
+
+	priv->tx_wr.wr_id 	      = wr_id;
+
+	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+}
+
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_tx_buf *tx_req;
+	dma_addr_t addr;
+
+	if (unlikely(skb->len > tx->mtu)) {
+		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+			   skb->len, tx->mtu);
+		++priv->stats.tx_dropped;
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return;
+	}
+
+	ipoib_dbg_data(priv, "sending packet %p, head %d length=%d connection=%p\n",
+		       skb, tx->tx_head, skb->len, tx);
+
+	/*
+	 * We put the skb into the tx_ring _before_ we call post_send()
+	 * because it's entirely possible that the completion handler will
+	 * run before we execute anything after the post_send().  That
+	 * means we have to make sure everything is properly recorded and
+	 * our state is consistent before we call post_send().
+	 */
+	tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)];
+	tx_req->skb = skb;
+	addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len,
+			      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(addr))) {
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return;
+	}
+	pci_unmap_addr_set(tx_req, mapping, addr);
+
+	if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
+			        addr, skb->len))) {
+		ipoib_warn(priv, "post_send failed\n");
+		++priv->stats.tx_errors;
+		dma_unmap_single(priv->ca->dma_device, addr, skb->len,
+				 DMA_TO_DEVICE);
+		dev_kfree_skb_any(skb);
+	} else {
+		dev->trans_start = jiffies;
+		++tx->tx_head;
+
+		if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) {
+			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+			netif_stop_queue(dev);
+			set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags);
+		}
+	}
+}
+
+static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx,
+				  struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned int wr_id = wc->wr_id;
+	struct ipoib_tx_buf *tx_req;
+	unsigned long flags;
+
+	ipoib_dbg_data(priv, "cm send completion: id %d, op %d, status: %d\n",
+		       wr_id, wc->opcode, wc->status);
+
+	if (unlikely(wr_id >= ipoib_sendq_size)) {
+		ipoib_warn(priv, "cm send completion event with wrid %d (> %d)\n",
+			   wr_id, ipoib_sendq_size);
+		return;
+	}
+
+	tx_req = &tx->tx_ring[wr_id];
+
+	dma_unmap_single(priv->ca->dma_device,
+			 pci_unmap_addr(tx_req, mapping),
+			 tx_req->skb->len,
+			 DMA_TO_DEVICE);
+
+	/* FIXME: is this right? Shouldn't we only increment on success? */
+	++priv->stats.tx_packets;
+	priv->stats.tx_bytes += tx_req->skb->len;
+
+	dev_kfree_skb_any(tx_req->skb);
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	++tx->tx_tail;
+	if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags) &&
+	    tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) {
+		netif_wake_queue(dev);
+	}
+
+	if (wc->status != IB_WC_SUCCESS &&
+	    wc->status != IB_WC_WR_FLUSH_ERR) {
+		struct ipoib_neigh *neigh;
+
+		ipoib_dbg(priv, "failed cm send event "
+			   "(status=%d, wrid=%d vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+
+		spin_lock(&priv->lock);
+	       	neigh = tx->neigh;
+
+		if (neigh) {
+			neigh->cm = NULL;
+			list_del(&neigh->list);
+			if (neigh->ah)
+				ipoib_put_ah(neigh->ah);
+			ipoib_neigh_free(neigh);
+
+			tx->neigh = NULL;
+		}
+		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+			list_move(&tx->list, &priv->cm.reap_list);
+			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		}
+
+		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
+
+		spin_unlock(&priv->lock);
+	}
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
+{
+	struct ipoib_cm_tx *tx = tx_ptr;
+	int n, i;
+
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	do {
+		n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc);
+		for (i = 0; i < n; ++i)
+			ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i);
+	} while (n == IPOIB_NUM_WC);
+}
+
+int ipoib_cm_dev_open(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	if (!IPOIB_CM_ENABLED(dev->dev_addr))
+		return 0;
+
+	priv->cm.cq = ib_create_cq(priv->ca, ipoib_cm_rx_completion, NULL, dev,
+				   ipoib_recvq_size + 1);
+	if (IS_ERR(priv->cm.cq)) {
+		printk(KERN_WARNING "%s: failed to create CQ\n", priv->ca->name);
+		return PTR_ERR(priv->cm.cq);
+	}
+
+	ib_req_notify_cq(priv->cm.cq, IB_CQ_NEXT_COMP);
+
+	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
+	if (IS_ERR(priv->cm.id)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ib_destroy_cq(priv->cm.cq);
+		return IS_ERR(priv->cm.id);
+	}
+
+	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
+			   0, NULL);
+	if (ret) {
+		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
+		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
+		ib_destroy_cm_id(priv->cm.id);
+		ib_destroy_cq(priv->cm.cq);
+		return ret;
+	}
+	return 0;
+}
+
+void ipoib_cm_dev_stop(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_rx *p;
+	unsigned long flags;
+
+	if (!IPOIB_CM_ENABLED(dev->dev_addr))
+		return;
+
+	ib_destroy_cm_id(priv->cm.id);
+	spin_lock_irqsave(&priv->lock, flags);
+	while (!list_empty(&priv->cm.passive_ids)) {
+		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		list_del_init(&p->list);
+		spin_unlock_irqrestore(&priv->lock, flags);
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+		spin_lock_irqsave(&priv->lock, flags);
+	}
+	spin_unlock_irqrestore(&priv->lock, flags);
+	ib_destroy_cq(priv->cm.cq);
+}
+
+static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_tx *p = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	struct ipoib_cm_data *data = event->private_data;
+	struct sk_buff_head skqueue;
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	p->mtu = be32_to_cpu(data->mtu);
+
+	if (p->mtu < priv->dev->mtu + IPOIB_ENCAP_LEN) {
+		ipoib_warn(priv, "Rejecting connection: mtu %d < device mtu %d + 4\n",
+			   p->mtu, priv->dev->mtu);
+		return -EINVAL;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTR;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret);
+		return ret;
+	}
+
+	qp_attr.rq_psn = 0 /* FIXME */;
+	ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTS;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret);
+		return ret;
+	}
+	ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret);
+		return ret;
+	}
+
+	skb_queue_head_init(&skqueue);
+
+	spin_lock_irqsave(&priv->lock, flags);
+	set_bit(IPOIB_FLAG_OPER_UP, &p->flags);
+	if (p->neigh)
+		while ((skb = __skb_dequeue(&p->neigh->queue)))
+			__skb_queue_tail(&skqueue, skb);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	while ((skb = __skb_dequeue(&skqueue))) {
+		skb->dev = p->dev;
+		if (dev_queue_xmit(skb))
+			ipoib_warn(priv, "dev_queue_xmit failed "
+				   "to requeue packet\n");
+	}
+
+	ret = ib_send_cm_rtu(cm_id, NULL, 0);
+	if (ret) {
+		ipoib_warn(priv, "failed to send RTU: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr attr = {};
+	attr.recv_cq = priv->cm.cq;
+	attr.srq = priv->cm.srq;
+	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_send_sge = 1;
+	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	attr.qp_type = IB_QPT_RC;
+	attr.send_cq = cq;
+	return ib_create_qp(priv->pd, &attr);
+}
+
+static int ipoib_cm_send_req(struct net_device *dev,
+			     struct ib_cm_id *id, struct ib_qp *qp,
+			     u32 qpn,
+			     struct ib_sa_path_rec *pathrec)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_data data = {};
+	struct ib_cm_req_param req = {};
+
+	data.qpn = cpu_to_be32(priv->qp->qp_num);
+	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+
+	req.primary_path 	      = pathrec;
+	req.alternate_path 	      = NULL;
+	req.service_id                = cpu_to_be64(IPOIB_CM_IETF_ID | qpn);
+	req.qp_num 		      = qp->qp_num;
+	req.qp_type 		      = qp->qp_type;
+	req.private_data 	      = &data;
+	req.private_data_len 	      = sizeof data;
+	req.flow_control 	      = 0;
+
+	req.starting_psn              = 0; /* FIXME */
+
+	/*
+	 * Pick some arbitrary defaults here; we could make these
+	 * module parameters if anyone cared about setting them.
+	 */
+	req.responder_resources	      = 4;
+	req.remote_cm_response_timeout = 20;
+	req.local_cm_response_timeout  = 20;
+	req.retry_count 	      = 0; /* RFC draft warns against retries */
+	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
+	req.max_cm_retries 	      = 15;
+	req.srq 	              = 15;
+	return ib_send_cm_req(id, &req);
+}
+
+static int ipoib_cm_modify_tx_init(struct net_device *dev,
+				  struct ib_cm_id *cm_id, struct ib_qp *qp)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
+	if (ret) {
+		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE;
+	qp_attr.port_num = priv->port;
+	qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT;
+
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify tx QP to INIT: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	int ret;
+
+	ipoib_dbg(priv, "Request connection %p for gid " IPOIB_GID_FMT " qpn 0x%x\n",
+		  p, IPOIB_GID_ARG(pathrec->dgid), qpn);
+
+	p->tx_ring = kzalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+				GFP_KERNEL);
+	if (!p->tx_ring) {
+		ipoib_warn(priv, "failed to allocate tx ring\n");
+		ret = -ENOMEM;
+		goto err_tx;
+	}
+
+	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
+			     ipoib_sendq_size + 1);
+	if (IS_ERR(p->cq)) {
+		ret = PTR_ERR(p->cq);
+		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);
+		goto err_cq;
+	}
+
+	ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		ipoib_warn(priv, "failed to request completion notification: %d\n", ret);
+		goto err_req_notify;
+	}
+
+	p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq);
+	if (IS_ERR(p->qp)) {
+		ret = PTR_ERR(p->qp);
+		ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret);
+		goto err_qp;
+	}
+
+	p->id = ib_create_cm_id(priv->ca, ipoib_cm_tx_handler, p);
+	if (IS_ERR(p->id)) {
+		ret = PTR_ERR(p->id);
+		ipoib_warn(priv, "failed to create tx cm id: %d\n", ret);
+		goto err_id;
+	}
+
+	ret = ipoib_cm_modify_tx_init(p->dev, p->id,  p->qp);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify tx qp to rtr: %d\n", ret);
+		goto err_modify;
+	}
+
+	ret = ipoib_cm_send_req(p->dev, p->id, p->qp, qpn, pathrec);
+	if (ret) {
+		ipoib_warn(priv, "failed to send cm req: %d\n", ret);
+		goto err_send_cm;
+	}
+	return 0;
+
+err_send_cm:
+err_modify:
+	ib_destroy_cm_id(p->id);
+err_id:
+	p->id = NULL;
+	ib_destroy_qp(p->qp);
+err_req_notify:
+err_qp:
+	p->qp = NULL;
+	ib_destroy_cq(p->cq);
+err_cq:
+	p->cq = NULL;
+err_tx:
+	return ret;
+}
+
+void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	struct ipoib_tx_buf *tx_req;
+
+	ipoib_dbg(priv, "Destroy active connection %p. head 0x%x tail 0x%x\n",
+		  p, p->tx_head, p->tx_tail);
+
+	if (p->id)
+		ib_destroy_cm_id(p->id);
+
+	if (p->qp)
+		ib_destroy_qp(p->qp);
+
+	if (p->cq)
+		ib_destroy_cq(p->cq);
+
+	if (p->tx_ring) {
+		while ((int) p->tx_tail - (int) p->tx_head < 0) {
+			tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
+			dma_unmap_single(priv->ca->dma_device,
+					 pci_unmap_addr(tx_req, mapping),
+					 tx_req->skb->len,
+					 DMA_TO_DEVICE);
+			dev_kfree_skb_any(tx_req->skb);
+			++p->tx_tail;
+		}
+
+		kfree(p->tx_ring);
+	}
+
+	kfree(p);
+}
+
+int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_tx *tx = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
+	struct ipoib_neigh *neigh;
+	unsigned long flags;
+	int ret;
+
+	switch (event->event) {
+	case IB_CM_DREQ_RECEIVED:
+		ipoib_dbg(priv, "DREQ received.\n");
+		ib_send_cm_drep(cm_id, NULL, 0);
+		break;
+	case IB_CM_REP_RECEIVED:
+		ipoib_dbg(priv, "REP received.\n");
+		ret = ipoib_cm_rep_handler(cm_id, event);
+		if (ret)
+			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
+				       NULL, 0, NULL, 0);
+		break;
+	case IB_CM_REQ_ERROR:
+	case IB_CM_REJ_RECEIVED:
+	case IB_CM_TIMEWAIT_EXIT:
+		ipoib_dbg(priv, "CM error %d.\n", event->event);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+	       	neigh = tx->neigh;
+
+		if (neigh) {
+			neigh->cm = NULL;
+			list_del(&neigh->list);
+			if (neigh->ah)
+				ipoib_put_ah(neigh->ah);
+			ipoib_neigh_free(neigh);
+
+			tx->neigh = NULL;
+		}
+
+		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+			list_move(&tx->list, &priv->cm.reap_list);
+			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		}
+
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				       struct ipoib_neigh *neigh)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_tx *tx;
+
+       	tx = kzalloc(sizeof *tx, GFP_ATOMIC);
+	if (!tx)
+		return NULL;
+
+	neigh->cm = tx;
+	tx->neigh = neigh;
+	tx->path = path;
+	tx->dev = dev;
+	list_add(&tx->list, &priv->cm.start_list);
+	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
+	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	return tx;
+}
+
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
+	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+		list_move(&tx->list, &priv->cm.reap_list);
+		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		ipoib_dbg(priv, "Reap connection for gid " IPOIB_GID_FMT "\n",
+			  IPOIB_GID_ARG(tx->neigh->dgid));
+		tx->neigh = NULL;
+	}
+}
+
+void ipoib_cm_tx_start(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh *neigh;
+	struct ipoib_cm_tx *p;
+	unsigned long flags;
+	int ret;
+
+	struct ib_sa_path_rec pathrec;
+	u32 qpn;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
+	while (!list_empty(&priv->cm.start_list)) {
+		p = list_entry(priv->cm.start_list.next, typeof(*p), list);
+		list_del_init(&p->list);
+		neigh = p->neigh;
+		qpn = IPOIB_QPN(neigh->neighbour->ha);
+		memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		ret = ipoib_cm_tx_init(p, qpn, &pathrec);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+		if (ret) {
+			neigh = p->neigh;
+			if (neigh) {
+				neigh->cm = NULL;
+				list_del(&neigh->list);
+				if (neigh->ah)
+					ipoib_put_ah(neigh->ah);
+				ipoib_neigh_free(neigh);
+			}
+			list_del(&p->list);
+			kfree(p);
+		}
+	}
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+void ipoib_cm_tx_reap(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_tx *p;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
+	while (!list_empty(&priv->cm.reap_list)) {
+		p = list_entry(priv->cm.reap_list.next, typeof(*p), list);
+		list_del(&p->list);
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		ipoib_cm_tx_destroy(p);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+	}
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+int ipoib_cm_dev_init(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_srq_init_attr srq_init_attr = {
+		.attr = {
+			.max_wr  = ipoib_recvq_size,
+			.max_sge = 1
+		}
+	};
+	int ret, i;
+
+	INIT_LIST_HEAD(&priv->cm.passive_ids);
+	INIT_LIST_HEAD(&priv->cm.reap_list);
+	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start, dev);
+	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap, dev);
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
+				    GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
+		       priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (ipoib_cm_alloc_rx_skb(dev, i)) {
+			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			return -ENOMEM;
+		}
+		if (ipoib_cm_post_receive(dev, i)) {
+			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			return -EIO;
+		}
+	}
+
+	priv->dev->dev_addr[0] = IPOIB_FLAGS_RC;
+
+	return 0;
+}
+
+void ipoib_cm_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, ret;
+
+	ipoib_dbg(priv, "Cleanup ipoib connected mode data.\n");
+
+	if (!priv->cm.srq)
+		return;
+	ret = ib_destroy_srq(priv->cm.srq);
+	if (ret)
+		ipoib_warn(priv, "ib_destroy_srq failed: %d\n", ret);
+
+	priv->cm.srq = NULL;
+	if (!priv->cm.srq_ring)
+		return;
+	for (i = 0; i < ipoib_recvq_size; ++i)
+		if (priv->cm.srq_ring[i].skb) {
+			dma_unmap_single(priv->ca->dma_device,
+					 pci_unmap_addr(&priv->cm.srq_ring[i],
+							mapping),
+					 IPOIB_CM_BUF_SIZE,
+					 DMA_FROM_DEVICE);
+			dev_kfree_skb_any(priv->cm.srq_ring[i].skb);
+			priv->cm.srq_ring[i].skb = NULL;
+		}
+	kfree(priv->cm.srq_ring);
+	priv->cm.srq_ring = NULL;
+}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 8bf5e9e..a4b2d21 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -119,6 +119,7 @@ static int ipoib_ib_post_receive(struct net_device *dev, int id)
 	return ret;
 }
 
+
 static int ipoib_alloc_rx_skb(struct net_device *dev, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -273,10 +274,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
 	++priv->tx_tail;
-	if (netif_queue_stopped(dev) &&
-	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) &&
-	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
+	if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags) &&
+	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		netif_wake_queue(dev);
+	}
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (wc->status != IB_WC_SUCCESS &&
@@ -378,6 +379,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
 			netif_stop_queue(dev);
+			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
 		}
 	}
 }
@@ -429,6 +431,13 @@ int ipoib_ib_dev_open(struct net_device *dev)
 		return -1;
 	}
 
+	ret = ipoib_cm_dev_open(dev);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
+		ipoib_ib_dev_stop(dev);
+		return -1;
+	}
+
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
 	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
 
@@ -514,6 +523,8 @@ int ipoib_ib_dev_stop(struct net_device *dev)
 
 	clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
 
+	ipoib_cm_dev_stop(dev);
+
 	/*
 	 * Move our QP to the error state and then reinitialize in
 	 * when all work requests have completed or have been flushed.
@@ -603,6 +614,8 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		return -ENODEV;
 	}
 
+	ipoib_cm_dev_init(dev);
+
 	if (dev->flags & IFF_UP) {
 		if (ipoib_ib_dev_open(dev)) {
 			ipoib_transport_dev_cleanup(dev);
@@ -659,6 +672,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	ipoib_mcast_stop_thread(dev, 1);
 	ipoib_mcast_dev_flush(dev);
 
+	ipoib_cm_dev_cleanup(dev);
 	ipoib_transport_dev_cleanup(dev);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 85522da..282c5ea 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -49,8 +49,6 @@
 
 #include <net/dst.h>
 
-#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
-
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("IP-over-InfiniBand net driver");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -145,6 +143,8 @@ static int ipoib_stop(struct net_device *dev)
 
 	netif_stop_queue(dev);
 
+	clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+
 	/*
 	 * Now flush workqueue to make sure a scheduled task doesn't
 	 * bring our internal state back up.
@@ -177,14 +177,27 @@ static int ipoib_stop(struct net_device *dev)
 static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int old_mtu = dev->mtu;
+
+	/* Simple heuristic: dev->mtu > 2K ==> connected mode */
+	/* flush paths if we switch modes so that connections are restarted */
+	if (IPOIB_CM_ENABLED(dev->dev_addr) &&
+	    new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN &&
+	    new_mtu <= IPOIB_CM_MTU) {
+		dev->mtu = new_mtu;
+		if (old_mtu <= IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+			ipoib_flush_paths(dev);
+		return 0;
+	}
 
 	if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
-		return -EINVAL;
+		    return -EINVAL;
 
 	priv->admin_mtu = new_mtu;
-
 	dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
 
+	if (old_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+		ipoib_flush_paths(dev);
 	return 0;
 }
 
@@ -414,6 +427,18 @@ static void path_rec_completion(int status,
 			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
 			       sizeof(union ib_gid));
 
+			if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+				if (!neigh->cm)
+					neigh->cm = ipoib_cm_create_tx(dev, path, neigh);
+				if (!neigh->cm) {
+					list_del(&neigh->list);
+					if (neigh->ah)
+						ipoib_put_ah(neigh->ah);
+					ipoib_neigh_free(neigh);
+					continue;
+				}
+			}
+
 			while ((skb = __skb_dequeue(&neigh->queue)))
 				__skb_queue_tail(&skqueue, skb);
 		}
@@ -522,7 +547,22 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
 		memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
 		       sizeof(union ib_gid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+		if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+			if (!neigh->cm)
+				neigh->cm = ipoib_cm_create_tx(dev, path, neigh);
+			if (!neigh->cm) {
+				list_del(&neigh->list);
+				if (neigh->ah)
+					ipoib_put_ah(neigh->ah);
+				ipoib_neigh_free(neigh);
+				goto err_drop;
+			}
+			if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
+				__skb_queue_tail(&neigh->queue, skb);
+			else
+				goto err_drop;
+		} else
+			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
 	} else {
 		neigh->ah  = NULL;
 		__skb_queue_tail(&neigh->queue, skb);
@@ -539,6 +579,7 @@ err_list:
 
 err_path:
 	ipoib_neigh_free(neigh);
+err_drop:
 	++priv->stats.tx_dropped;
 	dev_kfree_skb_any(skb);
 
@@ -641,7 +682,12 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		neigh = *to_ipoib_neigh(skb->dst->neighbour);
 
-		if (likely(neigh->ah)) {
+		if (ipoib_cm_get(neigh)) {
+			if (test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags)) {
+				ipoib_cm_send(dev, skb, neigh->cm);
+				goto out;
+			}
+		} else if (neigh->ah) {
 			if (unlikely(memcmp(&neigh->dgid.raw,
 					    skb->dst->neighbour->ha + 4,
 					    sizeof(union ib_gid)))) {
@@ -805,6 +851,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
 
 	neigh->neighbour = neighbour;
 	*to_ipoib_neigh(neighbour) = neigh;
+	neigh->cm = NULL;
 
 	return neigh;
 }
@@ -812,6 +859,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
 void ipoib_neigh_free(struct ipoib_neigh *neigh)
 {
 	*to_ipoib_neigh(neigh->neighbour) = NULL;
+	if (neigh->cm)
+		ipoib_cm_destroy_tx(neigh->cm);
 	kfree(neigh);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 3faa182..14337e9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -594,7 +594,11 @@ void ipoib_mcast_join_task(void *dev_ptr)
 
 	priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) -
 		IPOIB_ENCAP_LEN;
-	dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
+
+	/* Simple heuristic: dev->mtu > 2K ==> connected mode.
+	 * In this case do not touch dev->mtu. */
+	if (dev->mtu <= IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+		dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
 
 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
 

-- 
MST


From swise at opengridcomputing.com  Tue Dec  5 08:27:12 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 10:27:12 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165334529.16087.69.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
	<45754DE3.1020505@ens-lyon.org>
	<1165334529.16087.69.camel@stevo-desktop>
Message-ID: <1165336032.16087.89.camel@stevo-desktop>

On Tue, 2006-12-05 at 10:02 -0600, Steve Wise wrote:
> On Tue, 2006-12-05 at 11:45 +0100, Brice Goglin wrote:
> > Steve Wise wrote:
> > > There is no SW TCP stack in this driver.  The HW supports RDMA over
> > > TCP/IP/10GbE in HW and this is required for zero-copy RDMA over Ethernet
> > > (aka iWARP).  The device is a 10 GbE device, not Infiniband.
> > 
> > Then, I wonder why the driver goes in drivers/infiniband/ :)
> 
> drivers/infiniband support both IB and IWARP transports.
> 
> > Is there really no way to only keep the actual hw infiniband there, move
> > iwarp/rdma drivers in drivers/net/something/ and the core stuff in
> > net/something/ ?
> > 
> 
> Sure, this _could_ be done, but what I think you're missing is that
> applications use the interface exported by drivers/infiniband over both
> IB -and- IWARP transports.  The application can be written to not care
> which transport is used.   Examples of apps that can run over both
> transports using the same common interface: 
> 
> user mode: MVAPICH2, OMPI, IMPI, HPMPI, 
> kernel mode: NFS-RDMA, iSER.  
> 
> Note that the include directory used by drivers/infiniband is now
> include/rdma.  Perhaps drivers/infiniband should be renamed to
> drivers/rdma as well at some point...


By the way, FYI:  The Chelsio T3 device support is split into 2 driver
modules: the Ethernet driver and the RDMA driver.  The Ethernet driver
lives in drivers/net/cxgb3 while the RDMA driver lives in
drivers/infiniband/hw/cxgb3.  The Ethernet driver can be used
stand-alone as a 10GbE high-performance NIC driver.  The RDMA driver has
a config-time dependency on the Ethernet driver.

The 2nd version of the Ethernet driver was posted yesterday.  See:

http://www.spinics.net/lists/netdev/msg20464.html


Steve.


From johnpol at 2ka.mipt.ru  Tue Dec  5 08:31:33 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 19:31:33 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165335162.16087.79.camel@stevo-desktop>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
Message-ID: <20061205163008.GA30211@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 10:12:42AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> Ah.  Data from an offloaded connection cannot leak into the main stack
> nor vice-verse.  We can take an active RDMA connection establishment as
> an example if you want:  Once the message is sent to the HW to "setup a
> TCP connection from addr/port a.b to addr/port c.d", then packets on
> that connection (that 4-tuple) will always be delivered to the RDMA
> driver, not the native stack.  If the the packet received after the
> connection is setup is -not- an MPA reply (in this example), then the
> connection is aborted.  Once the connection is aborted.  So no leaking
> can happen.
 
And if there were a dataflow between addr/port a.b to addr/port c.d
already, it will either terminated?

Considering the following sequence:
handlers->t3c_handlers->sched()->work_queue->work_handlers()->for
example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming
skb and sets port/addr/route and other fields to be used as a base for rdma
connection. What if it just a usual network packet from kernelspace or 
userspace with the same payload as should be sent by remote rdma system?

-- 
	Evgeniy Polyakov


From swise at opengridcomputing.com  Tue Dec  5 08:47:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 10:47:25 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205163008.GA30211@2ka.mipt.ru>
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
Message-ID: <1165337245.16087.95.camel@stevo-desktop>

On Tue, 2006-12-05 at 19:31 +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 05, 2006 at 10:12:42AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > Ah.  Data from an offloaded connection cannot leak into the main stack
> > nor vice-verse.  We can take an active RDMA connection establishment as
> > an example if you want:  Once the message is sent to the HW to "setup a
> > TCP connection from addr/port a.b to addr/port c.d", then packets on
> > that connection (that 4-tuple) will always be delivered to the RDMA
> > driver, not the native stack.  If the the packet received after the
> > connection is setup is -not- an MPA reply (in this example), then the
> > connection is aborted.  Once the connection is aborted.  So no leaking
> > can happen.
>  
> And if there were a dataflow between addr/port a.b to addr/port c.d
> already, it will either terminated?
> 
> Considering the following sequence:
> handlers->t3c_handlers->sched()->work_queue->work_handlers()->for
> example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming
> skb and sets port/addr/route and other fields to be used as a base for rdma
> connection. What if it just a usual network packet from kernelspace or 
> userspace with the same payload as should be sent by remote rdma system?
> 

That skb isn't a network packet.  Its a CPL_PASS_ACCEPT_REQ message (see
struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h).  If the
RDMA driver hadn't registered to listen on that addr/port, it would
never get this skb.  Once a connection is established, the MPA messages
(and any TCP payload data) is delivered to the RDMA driver in the form
of skb's containing struct cpl_rx_data.  So these skbs aren't just TCP
packets at all.  They either control messages or TCP payload. Either way
they are encapsulated in CPL message structures.

Does this make sense?


From rdreier at cisco.com  Tue Dec  5 09:14:06 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 09:14:06 -0800
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <45754DE3.1020505@ens-lyon.org> (Brice Goglin's message of
	"Tue, 05 Dec 2006 11:45:55 +0100")
References: <20061202224917.27014.15424.stgit@dell3.ogc.int>
	<20061202224958.27014.65970.stgit@dell3.ogc.int>
	<20061204110825.GA26251@2ka.mipt.ru> <ada8xhnk6kv.fsf@cisco.com>
	<1165249251.32724.26.camel@stevo-desktop>
	<45754DE3.1020505@ens-lyon.org>
Message-ID: <adaslfufeox.fsf@cisco.com>

 > Is there really no way to only keep the actual hw infiniband there, move
 > iwarp/rdma drivers in drivers/net/something/ and the core stuff in
 > net/something/ ?

It's definitely possible, but rearranging the source tree hasn't been
a high priority (for me at least).

 - R.


From johnpol at 2ka.mipt.ru  Tue Dec  5 09:32:22 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 20:32:22 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205172649.GA20229@2ka.mipt.ru>
References: <ada8xhnk6kv.fsf@cisco.com> <20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
	<1165337245.16087.95.camel@stevo-desktop>
	<20061205172649.GA20229@2ka.mipt.ru>
Message-ID: <20061205173221.GB24149@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 08:26:49PM +0300, Evgeniy Polyakov (johnpol at 2ka.mipt.ru) wrote:
> On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > > And if there were a dataflow between addr/port a.b to addr/port c.d
> > > already, it will either terminated?
> > > 
> > > Considering the following sequence:
> > > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for
> > > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming
> > > skb and sets port/addr/route and other fields to be used as a base for rdma
> > > connection. What if it just a usual network packet from kernelspace or 
> > > userspace with the same payload as should be sent by remote rdma system?
> > > 
> > 
> > That skb isn't a network packet.  Its a CPL_PASS_ACCEPT_REQ message (see
> > struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h).  If the
> > RDMA driver hadn't registered to listen on that addr/port, it would
> > never get this skb.  Once a connection is established, the MPA messages
> > (and any TCP payload data) is delivered to the RDMA driver in the form
> > of skb's containing struct cpl_rx_data.  So these skbs aren't just TCP
> > packets at all.  They either control messages or TCP payload. Either way
> > they are encapsulated in CPL message structures.
> > 
> > Does this make sense?
>  
> Almost - except the case about where those skbs are coming from?
> It looks like they are obtained from network, since it is ethernet
> driver, and if they match some set of rules, they are considered as valid 
> MPA negotiation protocol.
> 
> If it is correct, it means that any packet in the network can be
> potentially 'stolen' by rdma hardware, although it was part of the usual
> dataflow. 
> If that packets are not from ethernet network, but from different
> low-level, then there is a question (besides why this driver is called
> ethernet if it manages different hardware) about how connection over
> that different media is being setup and since packets contain perfectly
> valid IP addresses and ports.

It looks like I've answered myself - it is _not_ ethernet driver, but
rdma one, and although it gets all data through skbs from ethernet
driver, the latter gets them not from ethernet network.
And thus addresses and ports and all other information can not be mixed
between the two.

-- 
	Evgeniy Polyakov


From halr at voltaire.com  Tue Dec  5 09:26:56 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 12:26:56 -0500
Subject: [openib-general] [PATCH 2/5] opensm: trivial indentation fixes
 in osm_switch.h
In-Reply-To: <11645802143335-git-send-email-sashak@voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802143335-git-send-email-sashak@voltaire.com>
Message-ID: <1165339569.25587.73006.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> Couple of trivial indentation fixes in osm_switch.h.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From johnpol at 2ka.mipt.ru  Tue Dec  5 09:26:50 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 20:26:50 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165337245.16087.95.camel@stevo-desktop>
References: <20061204110825.GA26251@2ka.mipt.ru>
	<ada8xhnk6kv.fsf@cisco.com> <20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
	<1165337245.16087.95.camel@stevo-desktop>
Message-ID: <20061205172649.GA20229@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > And if there were a dataflow between addr/port a.b to addr/port c.d
> > already, it will either terminated?
> > 
> > Considering the following sequence:
> > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for
> > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming
> > skb and sets port/addr/route and other fields to be used as a base for rdma
> > connection. What if it just a usual network packet from kernelspace or 
> > userspace with the same payload as should be sent by remote rdma system?
> > 
> 
> That skb isn't a network packet.  Its a CPL_PASS_ACCEPT_REQ message (see
> struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h).  If the
> RDMA driver hadn't registered to listen on that addr/port, it would
> never get this skb.  Once a connection is established, the MPA messages
> (and any TCP payload data) is delivered to the RDMA driver in the form
> of skb's containing struct cpl_rx_data.  So these skbs aren't just TCP
> packets at all.  They either control messages or TCP payload. Either way
> they are encapsulated in CPL message structures.
> 
> Does this make sense?
 
Almost - except the case about where those skbs are coming from?
It looks like they are obtained from network, since it is ethernet
driver, and if they match some set of rules, they are considered as valid 
MPA negotiation protocol.

If it is correct, it means that any packet in the network can be
potentially 'stolen' by rdma hardware, although it was part of the usual
dataflow. 
If that packets are not from ethernet network, but from different
low-level, then there is a question (besides why this driver is called
ethernet if it manages different hardware) about how connection over
that different media is being setup and since packets contain perfectly
valid IP addresses and ports.

And, btw, not related question - does postponing the whole skb multiplexing 
to work queue result in lower latency and/or higher speed?
Since there are a lot of tricks introduced to minimize gap between
interrupt/napi polling and protocol processing, so such huge postponing
with the whole context switch looks strange.

-- 
	Evgeniy Polyakov


From halr at voltaire.com  Tue Dec  5 09:12:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 12:12:19 -0500
Subject: [openib-general] [PATCH 1/5] opensm: eliminate global variable
	osm in updn
In-Reply-To: <11645802093253-git-send-email-sashak@voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802093253-git-send-email-sashak@voltaire.com>
Message-ID: <1165338719.25587.72460.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> Routing engine setup function for up/down already gets reference to osm
> object as parameter - we can keep this reference as part of updn_t
> structure rather than to use global variable for referencing osm object.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From swise at opengridcomputing.com  Tue Dec  5 09:51:40 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 11:51:40 -0600
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205172649.GA20229@2ka.mipt.ru>
References: <20061204110825.GA26251@2ka.mipt.ru>
	<ada8xhnk6kv.fsf@cisco.com> <20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
	<1165337245.16087.95.camel@stevo-desktop>
	<20061205172649.GA20229@2ka.mipt.ru>
Message-ID: <1165341100.16087.109.camel@stevo-desktop>

On Tue, 2006-12-05 at 20:26 +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 05, 2006 at 10:47:25AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > > And if there were a dataflow between addr/port a.b to addr/port c.d
> > > already, it will either terminated?
> > > 
> > > Considering the following sequence:
> > > handlers->t3c_handlers->sched()->work_queue->work_handlers()->for
> > > example CPL_PASS_ACCEPT_REQ->pass_accept_req() - it just parses incoming
> > > skb and sets port/addr/route and other fields to be used as a base for rdma
> > > connection. What if it just a usual network packet from kernelspace or 
> > > userspace with the same payload as should be sent by remote rdma system?
> > > 
> > 
> > That skb isn't a network packet.  Its a CPL_PASS_ACCEPT_REQ message (see
> > struct cpl_pass_accept_req in the Ethernet driver t3_cpl.h).  If the
> > RDMA driver hadn't registered to listen on that addr/port, it would
> > never get this skb.  Once a connection is established, the MPA messages
> > (and any TCP payload data) is delivered to the RDMA driver in the form
> > of skb's containing struct cpl_rx_data.  So these skbs aren't just TCP
> > packets at all.  They either control messages or TCP payload. Either way
> > they are encapsulated in CPL message structures.
> > 
> > Does this make sense?
>  
> Almost - except the case about where those skbs are coming from?
> It looks like they are obtained from network, since it is ethernet
> driver, and if they match some set of rules, they are considered as valid 
> MPA negotiation protocol.

They come from the Ethernet driver, but that driver manages multiple HW
queues and these packets come from an offload queue, not the NIC queue.
So the HW demultiplexes.

Perhaps Divy or Felix from Chelsio can expand on how the Ethernet driver
manages this?

> 
> If it is correct, it means that any packet in the network can be
> potentially 'stolen' by rdma hardware, although it was part of the usual
> dataflow. 
> If that packets are not from ethernet network, but from different
> low-level, then there is a question (besides why this driver is called
> ethernet if it manages different hardware) about how connection over
> that different media is being setup and since packets contain perfectly
> valid IP addresses and ports.

The HW has different queues for offload vs native Ethernet frames.  I'm
not an expert on the Ethernet driver, so you'll have to consult that
code and ask questions of Divy and/or Felix.

> And, btw, not related question - does postponing the whole skb multiplexing 
> to work queue result in lower latency and/or higher speed?
> Since there are a lot of tricks introduced to minimize gap between
> interrupt/napi polling and protocol processing, so such huge postponing
> with the whole context switch looks strange.
> 

Neither.   The work queue makes the RDMA driver's life easier because it
has context to allocate skbs, for instance.  Note all the work queue
stuff is done _only_ for RDMA connection setup and teardown.  Once the
connection is in RDMA mode, there's no work queues at all for IO, and CQ
notifications happen in interrupt context.  RDMA operations are
submitted to the hardware via iwch_post_send().  Completion notification
is done in the interrupt context via iwch_ev_dispatch().  And completion
entries reaped by the consumer application via iwch_poll_cq().


Steve.


From halr at voltaire.com  Tue Dec  5 10:07:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 13:07:40 -0500
Subject: [openib-general] [PATCH 3/5] opensm: routing engine improvements
In-Reply-To: <1164580219695-git-send-email-sashak@voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<1164580219695-git-send-email-sashak@voltaire.com>
Message-ID: <1165342043.25587.74435.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> This prevents lid matrix rebuilding with up/down algorithm when it is
> not required (a.e. when root nodes are specified by user), consolidates
> routing engine methods and simplifies default LFT creation flow.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From johnpol at 2ka.mipt.ru  Tue Dec  5 10:09:39 2006
From: johnpol at 2ka.mipt.ru (Evgeniy Polyakov)
Date: Tue, 5 Dec 2006 21:09:39 +0300
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <1165341100.16087.109.camel@stevo-desktop>
References: <20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
	<1165337245.16087.95.camel@stevo-desktop>
	<20061205172649.GA20229@2ka.mipt.ru>
	<1165341100.16087.109.camel@stevo-desktop>
Message-ID: <20061205180939.GA26384@2ka.mipt.ru>

On Tue, Dec 05, 2006 at 11:51:40AM -0600, Steve Wise (swise at opengridcomputing.com) wrote:
> > Almost - except the case about where those skbs are coming from?
> > It looks like they are obtained from network, since it is ethernet
> > driver, and if they match some set of rules, they are considered as valid 
> > MPA negotiation protocol.
> 
> They come from the Ethernet driver, but that driver manages multiple HW
> queues and these packets come from an offload queue, not the NIC queue.
> So the HW demultiplexes.

Ok, thanks for explaination.

-- 
	Evgeniy Polyakov


From xma at us.ibm.com  Tue Dec  5 10:11:19 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Tue, 5 Dec 2006 10:11:19 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <20061205161944.GD30209@mellanox.co.il>
Message-ID: <OF09AC5817.ACA3A1EE-ON8725723B.0063B936-8825723B.0063E81E@us.ibm.com>


Michael,

>The idea is to increase performance by increasing the MTU
>from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD.
>With this code, I'm able to get 800MByte/sec or more with netperf
>without options on a Mellanox 4x back-to-back DDR system.

What about CPU utilization?

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/770cab33/attachment.html>

From mshefty at ichips.intel.com  Tue Dec  5 09:57:39 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 09:57:39 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <457582E8.8030705@dev.mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
Message-ID: <4575B313.1010604@ichips.intel.com>

Dotan Barak wrote:
> This test does the following scenario:
>    restart the driver
>    start a user level application that allocate N multicast groups (it 
> is being executed in the background)

How does the application allocate the multicast groups?  Does this involve the 
kernel multicast module?  The multicast module expects that all join/leave 
operations go through it.  Can you produce this crash using only the multicast 
code and ipoib?

- Sean


From rdreier at cisco.com  Tue Dec  5 10:26:16 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 10:26:16 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <4575B313.1010604@ichips.intel.com> (Sean Hefty's message
	of "Tue, 05 Dec 2006 09:57:39 -0800")
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com>
Message-ID: <adairgqfbcn.fsf@cisco.com>

 > How does the application allocate the multicast groups?  Does this involve the 
 > kernel multicast module?  The multicast module expects that all join/leave 
 > operations go through it.  Can you produce this crash using only the multicast 
 > code and ipoib?

Dotan attached the full source of the test.

It just seems to attach local QPs to MCGs without talking to the SA at
all.  So it's not using the multicast module, but on the other hand I
can't see why what it does would have any relevance to the crash.

 - R.


From steve.apo at googlemail.com  Tue Dec  5 10:27:27 2006
From: steve.apo at googlemail.com (Steven Wooding)
Date: Tue, 5 Dec 2006 18:27:27 +0000
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could
	be wrong?
In-Reply-To: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
Message-ID: <2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com>

Hi again,

OK, so I've narrowed it down to the write() function returning the -1,
indicating an error. The value of errno I get is EINVAL, but indicates the
file descriptor is not valid. However, I've check the file descriptor value
and it's listing in the lsof output and all looks fine.

Just looking at the code in cm.c, how does the CM_CREATE_MSG_CMD macro work?
I can't seem to see where the "msg" parameter gets to point to the "cmd"
parameter. Just curious, as I know that the cmpost example application works
fine.

Any ideas?

By the way, I'm using OFED 1.1

Thanks,

Steve.


On 05/12/06, Steven Wooding <steve.apo at googlemail.com> wrote:
>
> Hi,
>
> In my application I keep getting -1 returned by a call to ib_cm_send_req()
> function. The cmpost example application works fine, so I can rule out
> system set-up issues.
>
> I could do with a glue as to what the -1 means and then hopefully correct
> my application.
>
> Thanks,
>
> Steve.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/6d0303dd/attachment.html>

From dotanb at dev.mellanox.co.il  Tue Dec  5 10:55:29 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Tue, 5 Dec 2006 20:55:29 +0200 (IST)
Subject: [openib-general] oops with multicast patches
In-Reply-To: <adairgqfbcn.fsf@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
Message-ID: <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il>

>
> Dotan attached the full source of the test.
>
> It just seems to attach local QPs to MCGs without talking to the SA at
> all.  So it's not using the multicast module, but on the other hand I
> can't see why what it does would have any relevance to the crash.
>
>  - R.

Roland is right: this test only attaches QPs to the mcg using verbs call
(ibv_attach_mcast).

I hope to improve this test during the next week(s) and add support to the
multicast module (or library, if available).

Dotan


From rdreier at cisco.com  Tue Dec  5 10:57:38 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 10:57:38 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il>
	(dotanb@dev.mellanox.co.il's message of
	"Tue, 5 Dec 2006 20:55:29 +0200 (IST)")
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
	<4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il>
Message-ID: <adaejref9wd.fsf@cisco.com>

 > Roland is right: this test only attaches QPs to the mcg using verbs call
 > (ibv_attach_mcast).

What I don't understand is why the test has any affect on the kernel
at all.  How could creating QPs and attaching them to MCGs with verbs
calls cause the crash??

 - R.


From halr at voltaire.com  Tue Dec  5 10:58:24 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 13:58:24 -0500
Subject: [openib-general] [PATCH 4/5] opensm: clean non used LFT entries,
 update only changed blocks
In-Reply-To: <11645802241342-git-send-email-sashak@voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802241342-git-send-email-sashak@voltaire.com>
Message-ID: <1165345060.25587.76545.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> This uses temporary buffer (one per OpenSM) for LFT entries generation.
> In this way old (actually "invalid") LFT entries are not preserved
> anymore and we can send update requests for only changed LFT blocks
> rather than whole table rewriting.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From rdreier at cisco.com  Tue Dec  5 11:00:36 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 11:00:36 -0800
Subject: [openib-general] [PATCH] IB/ipath: Remove unused "write-only"
	variables
Message-ID: <adaac22f9rf.fsf@cisco.com>

Remove variables that are set but then never looked at in the ipath
driver.  These cleanups came from David Binderman's list of "set but
never used" warnings from icc.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Bryan, does this look OK to merge?

 drivers/infiniband/hw/ipath/ipath_driver.c    |    4 +---
 drivers/infiniband/hw/ipath/ipath_file_ops.c  |    5 ++---
 drivers/infiniband/hw/ipath/ipath_iba6110.c   |    3 +--
 drivers/infiniband/hw/ipath/ipath_iba6120.c   |    6 +++---
 drivers/infiniband/hw/ipath/ipath_init_chip.c |    3 +--
 drivers/infiniband/hw/ipath/ipath_intr.c      |    3 +--
 drivers/infiniband/hw/ipath/ipath_sysfs.c     |    3 ---
 7 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/ipath/ipath_driver.c b/drivers/infiniband/hw/ipath/ipath_driver.c
index 1aeddb4..ae7f21a 100644
--- a/drivers/infiniband/hw/ipath/ipath_driver.c
+++ b/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -1825,8 +1825,6 @@ void ipath_write_kreg_port(const struct
  */
 void ipath_shutdown_device(struct ipath_devdata *dd)
 {
-	u64 val;
-
 	ipath_dbg("Shutting down the device\n");
 
 	dd->ipath_flags |= IPATH_LINKUNK;
@@ -1849,7 +1847,7 @@ void ipath_shutdown_device(struct ipath_
 	 */
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_sendctrl, 0ULL);
 	/* flush it */
-	val = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	/*
 	 * enough for anything that's going to trickle out to have actually
 	 * done so.
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index a9ddc69..ddbcabd 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -699,7 +699,6 @@ static int ipath_manage_rcvq(struct ipat
 			     int start_stop)
 {
 	struct ipath_devdata *dd = pd->port_dd;
-	u64 tval;
 
 	ipath_cdbg(PROC, "%sabling rcv for unit %u port %u:%u\n",
 		   start_stop ? "en" : "dis", dd->ipath_unit,
@@ -729,7 +728,7 @@ static int ipath_manage_rcvq(struct ipat
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl,
 			 dd->ipath_rcvctrl);
 	/* now be sure chip saw it before we return */
-	tval = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	if (start_stop) {
 		/*
 		 * And try to be sure that tail reg update has happened too.
@@ -738,7 +737,7 @@ static int ipath_manage_rcvq(struct ipat
 		 * in memory copy, since we could overwrite an update by the
 		 * chip if we did.
 		 */
-		tval = ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port);
+		ipath_read_ureg32(dd, ur_rcvhdrtail, pd->port_port);
 	}
 	/* always; new head should be equal to new tail; see above */
 bail:
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6110.c b/drivers/infiniband/hw/ipath/ipath_iba6110.c
index e57c7a3..7468477 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6110.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6110.c
@@ -1447,7 +1447,7 @@ static void ipath_ht_tidtemplate(struct
 static int ipath_ht_early_init(struct ipath_devdata *dd)
 {
 	u32 __iomem *piobuf;
-	u32 pioincr, val32, egrsize;
+	u32 pioincr, val32;
 	int i;
 
 	/*
@@ -1467,7 +1467,6 @@ static int ipath_ht_early_init(struct ip
 	 * errors interrupts if we ever see one).
 	 */
 	dd->ipath_rcvegrbufsize = dd->ipath_piosize2k;
-	egrsize = dd->ipath_rcvegrbufsize;
 
 	/*
 	 * the min() check here is currently a nop, but it may not
diff --git a/drivers/infiniband/hw/ipath/ipath_iba6120.c b/drivers/infiniband/hw/ipath/ipath_iba6120.c
index 6af8968..397da34 100644
--- a/drivers/infiniband/hw/ipath/ipath_iba6120.c
+++ b/drivers/infiniband/hw/ipath/ipath_iba6120.c
@@ -602,7 +602,7 @@ static void ipath_pe_init_hwerrors(struc
  */
 static int ipath_pe_bringup_serdes(struct ipath_devdata *dd)
 {
-	u64 val, tmp, config1, prev_val;
+	u64 val, config1, prev_val;
 	int ret = 0;
 
 	ipath_dbg("Trying to bringup serdes\n");
@@ -633,7 +633,7 @@ static int ipath_pe_bringup_serdes(struc
 		| INFINIPATH_SERDC0_L1PWR_DN;
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val);
 	/* be sure chip saw it */
-	tmp = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	udelay(5);		/* need pll reset set at least for a bit */
 	/*
 	 * after PLL is reset, set the per-lane Resets and TxIdle and
@@ -647,7 +647,7 @@ static int ipath_pe_bringup_serdes(struc
 		   "and txidle (%llx)\n", (unsigned long long) val);
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_serdesconfig0, val);
 	/* be sure chip saw it */
-	tmp = ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
+	ipath_read_kreg64(dd, dd->ipath_kregs->kr_scratch);
 	/* need PLL reset clear for at least 11 usec before lane
 	 * resets cleared; give it a few more to be sure */
 	udelay(15);
diff --git a/drivers/infiniband/hw/ipath/ipath_init_chip.c b/drivers/infiniband/hw/ipath/ipath_init_chip.c
index d819cca..d4f6b52 100644
--- a/drivers/infiniband/hw/ipath/ipath_init_chip.c
+++ b/drivers/infiniband/hw/ipath/ipath_init_chip.c
@@ -347,10 +347,9 @@ done:
 static int init_chip_reset(struct ipath_devdata *dd,
 			   struct ipath_portdata **pdp)
 {
-	struct ipath_portdata *pd;
 	u32 rtmp;
 
-	*pdp = pd = dd->ipath_pd[0];
+	*pdp = dd->ipath_pd[0];
 	/* ensure chip does no sends or receives while we re-initialize */
 	dd->ipath_control = dd->ipath_sendctrl = dd->ipath_rcvctrl = 0U;
 	ipath_write_kreg(dd, dd->ipath_kregs->kr_rcvctrl, 0);
diff --git a/drivers/infiniband/hw/ipath/ipath_intr.c b/drivers/infiniband/hw/ipath/ipath_intr.c
index 5652a55..72b9e27 100644
--- a/drivers/infiniband/hw/ipath/ipath_intr.c
+++ b/drivers/infiniband/hw/ipath/ipath_intr.c
@@ -598,10 +598,9 @@ static int handle_errors(struct ipath_de
 	 * on close
 	 */
 	if (errs & INFINIPATH_E_RRCVHDRFULL) {
-		int any;
 		u32 hd, tl;
 		ipath_stats.sps_hdrqfull++;
-		for (any = i = 0; i < dd->ipath_cfgports; i++) {
+		for (i = 0; i < dd->ipath_cfgports; i++) {
 			struct ipath_portdata *pd = dd->ipath_pd[i];
 			if (i == 0) {
 				hd = dd->ipath_port0head;
diff --git a/drivers/infiniband/hw/ipath/ipath_sysfs.c b/drivers/infiniband/hw/ipath/ipath_sysfs.c
index 182de34..ffa6318 100644
--- a/drivers/infiniband/hw/ipath/ipath_sysfs.c
+++ b/drivers/infiniband/hw/ipath/ipath_sysfs.c
@@ -215,7 +215,6 @@ static ssize_t store_mlid(struct device
 			  size_t count)
 {
 	struct ipath_devdata *dd = dev_get_drvdata(dev);
-	int unit;
 	u16 mlid;
 	int ret;
 
@@ -223,8 +222,6 @@ static ssize_t store_mlid(struct device
 	if (ret < 0 || mlid < IPATH_MULTICAST_LID_BASE)
 		goto invalid;
 
-	unit = dd->ipath_unit;
-
 	dd->ipath_mlid = mlid;
 
 	goto bail;
-- 
1.4.3.2


From rdreier at cisco.com  Tue Dec  5 11:10:07 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 11:10:07 -0800
Subject: [openib-general] [PATCH] IB/iser: Remove unused "write-only"
	variables
Message-ID: <ada64cqf9bk.fsf@cisco.com>

Remove variables that are set but then never looked at in the iSER
initiator.  These cleanups came from David Binderman's list of "set
but never used" warnings from icc.

Signed-off-by: Roland Dreier <rolandd at cisco.com>
---
Erez, does this look OK to merge?

 drivers/infiniband/ulp/iser/iser_initiator.c |    4 ----
 drivers/infiniband/ulp/iser/iser_memory.c    |    3 +--
 2 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
index 9b3d79c..e73c87b 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -487,10 +487,8 @@ int iser_send_control(struct iscsi_conn
 	struct iscsi_iser_conn *iser_conn = conn->dd_data;
 	struct iser_desc *mdesc = mtask->dd_data;
 	struct iser_dto *send_dto = NULL;
-	unsigned int itt;
 	unsigned long data_seg_len;
 	int err = 0;
-	unsigned char opcode;
 	struct iser_regd_buf *regd_buf;
 	struct iser_device *device;
 
@@ -512,8 +510,6 @@ int iser_send_control(struct iscsi_conn
 
 	iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE);
 
-	itt = ntohl(mtask->hdr->itt);
-	opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK;
 	data_seg_len = ntoh24(mtask->hdr->dlength);
 
 	if (data_seg_len > 0) {
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
index 0606744..e5a1091 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -234,7 +234,7 @@ static int iser_sg_to_page_vec(struct is
 {
 	struct scatterlist *sg = (struct scatterlist *)data->buf;
 	dma_addr_t first_addr, last_addr, page;
-	int start_aligned, end_aligned;
+	int end_aligned;
 	unsigned int cur_page = 0;
 	unsigned long total_sz = 0;
 	int i;
@@ -248,7 +248,6 @@ static int iser_sg_to_page_vec(struct is
 		first_addr = sg_dma_address(&sg[i]);
 		last_addr  = first_addr + sg_dma_len(&sg[i]);
 
-		start_aligned = !(first_addr & ~MASK_4K);
 		end_aligned   = !(last_addr  & ~MASK_4K);
 
 		/* continue to collect page fragments till aligned or SG ends */
-- 
1.4.3.2


From mshefty at ichips.intel.com  Tue Dec  5 10:26:19 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 10:26:19 -0800
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could
 be wrong?
In-Reply-To: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
Message-ID: <4575B9CB.5070507@ichips.intel.com>

> In my application I keep getting -1 returned by a call to 
> ib_cm_send_req() function. The cmpost example application works fine, so 
> I can rule out system set-up issues.

This is probably an error being returned from the kernel.  Does errno give any 
more insight?

- Sean


From mshefty at ichips.intel.com  Tue Dec  5 10:31:33 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 10:31:33 -0800
Subject: [openib-general] OFED 1.2 features update
In-Reply-To: <45759B8C.8010408@dev.mellanox.co.il>
References: <45759B8C.8010408@dev.mellanox.co.il>
Message-ID: <4575BB05.7040106@ichips.intel.com>

> 4. Sean should prepare patches or git tree for kernel code that is not 
> upstream (e.g. SA cache)

I created a kernel git tree with branches for most of the code that was in svn, 
but not upstream.  (The SA cache is the last missing piece that needs to be 
added.)  Branches were made for the rdma_ucm, multicast support, utilities such 
as madeye, and kernel test apps.  Branches were also added to the librdmacm to 
match with the rdma_ucm and multicast branches.

- Sean


From rdreier at cisco.com  Tue Dec  5 11:12:48 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 11:12:48 -0800
Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in
 c2_qp_modify.
In-Reply-To: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>
	(Krishna Kumar's message of "Mon, 04 Dec 2006 09:14:57 +0530")
References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <ada1wnef973.fsf@cisco.com>

Looks right to me.  Tom/Steve, should I merge this?

 > --- org/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-15 12:40:04.000000000 +0530
 > +++ new/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-16 18:10:03.000000000 +0530
 > @@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
 >  
 >  	if (attr_mask & IB_QP_STATE) {
 >  		/* Ensure the state is valid */
 > -		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
 > -			return -EINVAL;
 > +		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) {
 > +			err = -EINVAL;
 > +			goto bail0;
 > +		}
 >  
 >  		wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state));
 >  
 > @@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
 >  		if (attr->cur_qp_state != IB_QPS_RTR &&
 >  		    attr->cur_qp_state != IB_QPS_RTS &&
 >  		    attr->cur_qp_state != IB_QPS_SQD &&
 > -		    attr->cur_qp_state != IB_QPS_SQE)
 > -			return -EINVAL;
 > -		else
 > +		    attr->cur_qp_state != IB_QPS_SQE) {
 > +			err = -EINVAL;
 > +			goto bail0;
 > +		} else
 >  			wr.next_qp_state =
 >  			    cpu_to_be32(to_c2_state(attr->cur_qp_state));
 >  
 > 


From dotanb at dev.mellanox.co.il  Tue Dec  5 11:15:03 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Tue, 5 Dec 2006 21:15:03 +0200 (IST)
Subject: [openib-general] oops with multicast patches
In-Reply-To: <adaejref9wd.fsf@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
	<4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il>
	<adaejref9wd.fsf@cisco.com>
Message-ID: <41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il>

>  > Roland is right: this test only attaches QPs to the mcg using verbs
> call
>  > (ibv_attach_mcast).
>
> What I don't understand is why the test has any affect on the kernel
> at all.  How could creating QPs and attaching them to MCGs with verbs
> calls cause the crash??

Everytime that the IPoIB finds out that there is a new SM (using the
client reregister or LID change event), it try to join several (3-4) mcgs.

The user level application uses almost all of the multicast groups in the
machine. Only 1,2,3,4 mcgs (depends on the test case) are available for
the IPoIB to use: it start to attach (and maybe join) all the mcgs that it
needs and fails when it reaches the HCA mcgs limit.

Maybe there is a problem with the multicast module when the attach to
multicast group verb fails?

Dotan


From swise at opengridcomputing.com  Tue Dec  5 11:17:17 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 05 Dec 2006 13:17:17 -0600
Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in
 c2_qp_modify.
In-Reply-To: <ada1wnef973.fsf@cisco.com>
References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>
	<ada1wnef973.fsf@cisco.com>
Message-ID: <1165346238.16087.128.camel@stevo-desktop>

yes, this looks correct.  

Sorry I missed this or i would have acked it earlier...

Steve.

On Tue, 2006-12-05 at 11:12 -0800, Roland Dreier wrote:
> Looks right to me.  Tom/Steve, should I merge this?
> 
>  > --- org/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-15 12:40:04.000000000 +0530
>  > +++ new/drivers/infiniband/hw/amso1100/c2_qp.c	2006-11-16 18:10:03.000000000 +0530
>  > @@ -161,8 +161,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
>  >  
>  >  	if (attr_mask & IB_QP_STATE) {
>  >  		/* Ensure the state is valid */
>  > -		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR)
>  > -			return -EINVAL;
>  > +		if (attr->qp_state < 0 || attr->qp_state > IB_QPS_ERR) {
>  > +			err = -EINVAL;
>  > +			goto bail0;
>  > +		}
>  >  
>  >  		wr.next_qp_state = cpu_to_be32(to_c2_state(attr->qp_state));
>  >  
>  > @@ -184,9 +186,10 @@ int c2_qp_modify(struct c2_dev *c2dev, s
>  >  		if (attr->cur_qp_state != IB_QPS_RTR &&
>  >  		    attr->cur_qp_state != IB_QPS_RTS &&
>  >  		    attr->cur_qp_state != IB_QPS_SQD &&
>  > -		    attr->cur_qp_state != IB_QPS_SQE)
>  > -			return -EINVAL;
>  > -		else
>  > +		    attr->cur_qp_state != IB_QPS_SQE) {
>  > +			err = -EINVAL;
>  > +			goto bail0;
>  > +		} else
>  >  			wr.next_qp_state =
>  >  			    cpu_to_be32(to_c2_state(attr->cur_qp_state));
>  >  
>  > 


From rdreier at cisco.com  Tue Dec  5 11:18:08 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 11:18:08 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il>
	(dotanb@dev.mellanox.co.il's message of
	"Tue, 5 Dec 2006 21:15:03 +0200 (IST)")
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
	<4441.85.65.223.155.1165344929.squirrel@dev.mellanox.co.il>
	<adaejref9wd.fsf@cisco.com>
	<41181.194.90.237.34.1165346103.squirrel@dev.mellanox.co.il>
Message-ID: <adawt56dudr.fsf@cisco.com>

 > The user level application uses almost all of the multicast groups in the
 > machine. Only 1,2,3,4 mcgs (depends on the test case) are available for
 > the IPoIB to use: it start to attach (and maybe join) all the mcgs that it
 > needs and fails when it reaches the HCA mcgs limit.

Ohh... I see.

 > Maybe there is a problem with the multicast module when the attach to
 > multicast group verb fails?

I guess it would be in the ipoib driver error handling or how it
interacts with the multicast module, since the attach to multicast
happens in the ipoib driver, not the multicast module.

Sean, does this give you any ideas of what to look at?

 - R.


From bos at pathscale.com  Tue Dec  5 11:29:15 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Tue, 05 Dec 2006 11:29:15 -0800
Subject: [openib-general] [PATCH/RFC] busted request IRQ for PCIe ipath
	HCAs
In-Reply-To: <adafybvgevn.fsf@cisco.com>
References: <adafybvgevn.fsf@cisco.com>
Message-ID: <4575C88B.9000207@pathscale.com>

Roland Dreier wrote:
> Bryan/anyone at Qlogic, does this look right?  It worked for me, so if
> this is what was intended, I will queue the patch for 2.6.20 and
> submit to stable at kernel.org for 2.6.19.x.
>   
Yes, this looks correct to me.

    <b


From steve.apo at googlemail.com  Tue Dec  5 11:28:41 2006
From: steve.apo at googlemail.com (Steven Wooding)
Date: Tue, 5 Dec 2006 19:28:41 +0000
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could
 be wrong?
In-Reply-To: <4575B9CB.5070507@ichips.intel.com>
References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
	<4575B9CB.5070507@ichips.intel.com>
Message-ID: <2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com>

Hi Sean,

Yeah, in my second post I said that errno was EINVAL just after the
ib_cm_send_req() call, which I assume was from the write() call. Or did you
mean something else?

Steve.

On 05/12/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
>
> > In my application I keep getting -1 returned by a call to
> > ib_cm_send_req() function. The cmpost example application works fine, so
> > I can rule out system set-up issues.
>
> This is probably an error being returned from the kernel.  Does errno give
> any
> more insight?
>
> - Sean
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/a041225b/attachment.html>

From bos at pathscale.com  Tue Dec  5 11:31:15 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Tue, 05 Dec 2006 11:31:15 -0800
Subject: [openib-general] [PATCH] IB/ipath: Remove unused "write-only"
	variables
In-Reply-To: <adaac22f9rf.fsf@cisco.com>
References: <adaac22f9rf.fsf@cisco.com>
Message-ID: <4575C903.5010605@pathscale.com>

Roland Dreier wrote:
> Remove variables that are set but then never looked at in the ipath
> driver.  These cleanups came from David Binderman's list of "set but
> never used" warnings from icc.
> 
> Signed-off-by: Roland Dreier <rolandd at cisco.com>

Acked-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

	<b


From ramachandra.kuchimanchi at qlogic.com  Tue Dec  5 11:33:28 2006
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi)
Date: Tue, 5 Dec 2006 13:33:28 -0600
Subject: [openib-general] ib_send_cm_dreq() and cm_id doubt
Message-ID: <C07C40DB2364324799506DE8FF12F8D817C642@EPEXCH1.qlogic.org>


After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy the cm_id
without waiting for a DREP ? This is of course assuming that we are not really
concerned if the DREQ reached the other end or not.

Regards,
Ram
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061205/632b7ce8/attachment.html>

From sean.hefty at intel.com  Tue Dec  5 11:33:25 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Tue, 5 Dec 2006 11:33:25 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <adawt56dudr.fsf@cisco.com>
Message-ID: <000001c718a4$3c313730$8698070a@amr.corp.intel.com>

> > Maybe there is a problem with the multicast module when the attach to
> > multicast group verb fails?
>
>I guess it would be in the ipoib driver error handling or how it
>interacts with the multicast module, since the attach to multicast
>happens in the ipoib driver, not the multicast module.
>
>Sean, does this give you any ideas of what to look at?

I think so.  Thanks for the feedback.  Hopefully I can reproduce this now.


From halr at voltaire.com  Tue Dec  5 11:38:04 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 14:38:04 -0500
Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down
 routing engines
In-Reply-To: <11645802302048-git-send-email-sashak@voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802302048-git-send-email-sashak@voltaire.com>
Message-ID: <1165347459.25587.78224.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> This updates "file" and "updn" (up/down) routing engines which should
> work properly now with changed LFT setup mechanism.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From tziporet at dev.mellanox.co.il  Tue Dec  5 11:50:44 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 05 Dec 2006 21:50:44 +0200
Subject: [openib-general] OFED 1.2 features update
In-Reply-To: <4575BB05.7040106@ichips.intel.com>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
Message-ID: <4575CD94.8070608@dev.mellanox.co.il>

Sean Hefty wrote:
>
> I created a kernel git tree with branches for most of the code that 
> was in svn, but not upstream.  (The SA cache is the last missing piece 
> that needs to be added.)  Branches were made for the rdma_ucm, 
> multicast support, utilities such as madeye, and kernel test apps.  
> Branches were also added to the librdmacm to match with the rdma_ucm 
> and multicast branches.
>
> - Sean

great - we will work to integrate them.
BTW - where are those trees located?

Tziporet


From mshefty at ichips.intel.com  Tue Dec  5 10:48:51 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 10:48:51 -0800
Subject: [openib-general] oops with multicast patches
In-Reply-To: <adairgqfbcn.fsf@cisco.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
Message-ID: <4575BF13.5010607@ichips.intel.com>

Roland Dreier wrote:
> It just seems to attach local QPs to MCGs without talking to the SA at
> all.  So it's not using the multicast module, but on the other hand I
> can't see why what it does would have any relevance to the crash.

 From his scenario, I thought that there were two applications running, but I 
could be off.  From looking at the attached file, I didn't see a relation 
between that code and any crash.  Dotan, can you clarify what you meant by 
"allocate N multicast groups" and "later application get the mcgs"?

 From your description, does this crash (always?) occur on the node that's 
running the SM?  I have not run this code in the SM node, so I will try this.

- Sean


From mshefty at ichips.intel.com  Tue Dec  5 11:22:14 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 11:22:14 -0800
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could
 be wrong?
In-Reply-To: <2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com>
References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
	<2cfcf21e0612051027s2c1d45cbk134d0a6ac94f480@mail.gmail.com>
Message-ID: <4575C6E6.8060000@ichips.intel.com>

> OK, so I've narrowed it down to the write() function returning the -1, 
> indicating an error. The value of errno I get is EINVAL, but indicates 
> the file descriptor is not valid. However, I've check the file 
> descriptor value and it's listing in the lsof output and all looks fine.

My guess is that one of the values set in ib_cm_req_param is off.  It could be a 
byte-ordering issue, or maybe the path record has invalid fields.  Posting your 
cm_req_param values might help identify the problem.

- Sean


From dotanb at dev.mellanox.co.il  Tue Dec  5 12:23:40 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Tue, 5 Dec 2006 22:23:40 +0200 (IST)
Subject: [openib-general] oops with multicast patches
In-Reply-To: <4575BF13.5010607@ichips.intel.com>
References: <6C2C79E72C305246B504CBA17B5500C9076BD6@mtlexch01.mtl.com>
	<20061204142214.GA5426@mellanox.co.il>
	<20061204152624.GA8269@mellanox.co.il>
	<45746375.5010107@ichips.intel.com>
	<45756002.3030806@dev.mellanox.co.il>
	<457582E8.8030705@dev.mellanox.co.il>
	<4575B313.1010604@ichips.intel.com> <adairgqfbcn.fsf@cisco.com>
	<4575BF13.5010607@ichips.intel.com>
Message-ID: <46814.194.90.237.34.1165350220.squirrel@dev.mellanox.co.il>

> Roland Dreier wrote:
>> It just seems to attach local QPs to MCGs without talking to the SA at
>> all.  So it's not using the multicast module, but on the other hand I
>> can't see why what it does would have any relevance to the crash.
>
>  From his scenario, I thought that there were two applications running,
> but I
> could be off.  From looking at the attached file, I didn't see a relation
> between that code and any crash.  Dotan, can you clarify what you meant by
> "allocate N multicast groups" and "later application get the mcgs"?
No, there is only one application and a script that executes the SM and
the application (but it takes some time until this application allocates
all the mcgs it will use).

>
>  From your description, does this crash (always?) occur on the node that's
> running the SM?  I have not run this code in the SM node, so I will try
> this.
No.

I executed this test in back2back configuration and both of the machines
hanged. However, most of the time (4/5) the machine that executed the SM
got the crash ...

Dotan


From mshefty at ichips.intel.com  Tue Dec  5 12:02:41 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 12:02:41 -0800
Subject: [openib-general] ib_send_cm_dreq() and cm_id doubt
In-Reply-To: <C07C40DB2364324799506DE8FF12F8D817C642@EPEXCH1.qlogic.org>
References: <C07C40DB2364324799506DE8FF12F8D817C642@EPEXCH1.qlogic.org>
Message-ID: <4575D061.3010808@ichips.intel.com>

> After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy the cm_id
>  without waiting for a DREP ? This is of course assuming that we are not 
> really concerned if the DREQ reached the other end or not.

Yes - you can even destroy the cm_id before calling ib_send_cm_dreq(), which 
will result in sending a DREQ if the cm_id is still connected.

- Sean


From mshefty at ichips.intel.com  Tue Dec  5 12:03:52 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 05 Dec 2006 12:03:52 -0800
Subject: [openib-general] OFED 1.2 features update
In-Reply-To: <4575CD94.8070608@dev.mellanox.co.il>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
Message-ID: <4575D0A8.7080501@ichips.intel.com>

> BTW - where are those trees located?

My trees are available from the staging.openfabrics.org/git site.  I called the 
kernel tree rdma-dev.

- Sean


From rdreier at cisco.com  Tue Dec  5 12:27:56 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 12:27:56 -0800
Subject: [openib-general] [PATCH] RDMA/amso1100: Fix memory leak in
 c2_qp_modify.
In-Reply-To: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>
	(Krishna Kumar's message of "Mon, 04 Dec 2006 09:14:57 +0530")
References: <20061204034457.5175.59086.sendpatchset@K50wks273871wss.in.ibm.com>
Message-ID: <adak616dr5f.fsf@cisco.com>

OK, applied for 2.6.20


From halr at voltaire.com  Tue Dec  5 12:31:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 05 Dec 2006 15:31:18 -0500
Subject: [openib-general] IPoIB and MC Group leaving
In-Reply-To: <1165248082.25587.8839.camel@hal.voltaire.com>
References: <1165243803.25587.5906.camel@hal.voltaire.com>
	<ada4psbk6f3.fsf@cisco.com>
	<1165248082.25587.8839.camel@hal.voltaire.com>
Message-ID: <1165350656.25587.80533.camel@hal.voltaire.com>

On Mon, 2006-12-04 at 11:01, Hal Rosenstock wrote:
> On Mon, 2006-12-04 at 10:49, Roland Dreier wrote:
> >  > This is to make sure node is not registered in any groups. This leave
> >  > may not be successful. Failure is "normal" when the subnet is starting
> >  > up "fresh". There are other cases where the failure is indeed a failure.
> > 
> > As far as I know, IPoIB will not leave a group unless it thinks it has
> > joined the group.  What is the code path for a "preemptive" leave?
> 
> OK maybe I have that part wrong but what about the other part:

Roland,

> The fact that a leave doesn't wait for the response and then a join is
> issued. I think there is a race condition here perhaps triggered by
> client reregistration.

Are the leave and join for the same port and group done by the same or
different threads within IPoIB ? Is there any way they can be reordered
so that the join occurs before the leave rather than the other way
around ? It does appear that the leave is sent once and only once (it is
not retried as far as I can tell).

-- Hal

> -- Hal
> 
> >  - R.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Tue Dec  5 12:37:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 12:37:12 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	(Ralph Campbell's message of "Fri, 1 Dec 2006 18:08:42 -0800 (PST)")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
Message-ID: <adad56ydqpz.fsf@cisco.com>

I think this seems reasonable.  And I think it also provides a way to
address some hypothetical future situation where lowmem pages don't
have a kernel virtual address -- you would just have to use this
type of cookie implementation everywhere.

(Although I don't think using kmap()/kunmap() is really the right
approach -- you should probably just do kmap_atomic()/kunmap_atomic()
only while you are actually using the page.  But the basic approach of
using the dma address as a cookie into a mapping table seems sound to
me -- you are basically doing a real sw iotlb)

Or -- does this seem reasonable to you?

 - R.


From rdreier at cisco.com  Tue Dec  5 12:40:03 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 12:40:03 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	(Ralph Campbell's message of "Fri, 1 Dec 2006 18:08:42 -0800 (PST)")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
Message-ID: <ada8xhmdql8.fsf@cisco.com>

Something weird happened with your mail setup -- your email seemed to
come from ralph.campbel at qlogic.com (only one "L" in your last name).
Anyway I assume you saw my response on the email list.

Also I forgot to mention one more thing: if you repost your patch set
with the sync_single fix that Or found then I am inclined to merge
this for 2.6.20 unless Or or someone else objects.

 - R.


From rdreier at cisco.com  Tue Dec  5 12:47:14 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 12:47:14 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <20061205161944.GD30209@mellanox.co.il> (Michael S.
	Tsirkin's message of "Tue, 5 Dec 2006 18:19:44 +0200")
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
Message-ID: <ada1wnedq99.fsf@cisco.com>

OK, just a very quick scan through:

 > +ib_ipoib-$(INFINIBAND_IPOIB_CM)			+= ipoib_cm.o

Does this actually work in the Makefile without the CONFIG_ prefix?  I
don't think it's intended anyway...

 > +#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
 > +
 > +

trim one of these blank lines...

 > +	IPOIB_CM_MTU              = 0x10000 - 0x10, /* padding to align header to 16 */
 > +	IPOIB_CM_BUF_SIZE         = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,

 > +	skb = dev_alloc_skb(IPOIB_CM_BUF_SIZE + 12);

This means every RX buffer is an order-4 allocation (with 4K pages).
I think that has to be fixed for us to consider this, or else
connected mode is basically useless on a loaded system.

 > +	IPOIB_FLAG_NETIF_STOPPED  = 9,

I can't follow what this is used for.  Can you explain in small words?

Why is this:

 > +struct ipoib_cm_dev_priv {
 > +	struct ib_cq  	    *cq;
 > +	struct ib_srq  	    *srq;
 > +	struct ipoib_rx_buf *srq_ring;
 > +	struct ib_cm_id     *id;
 > +	struct list_head     passive_ids;
 > +	struct work_struct   start_task;
 > +	struct work_struct   reap_task;
 > +	struct list_head     start_list;
 > +	struct list_head     reap_list;
 > +	struct ib_wc         ibwc[IPOIB_NUM_WC];
 > +};
 > +
 >  /*
 >   * Device private locking: tx_lock protects members used in TX fast
 >   * path (and we use LLTX so upper layers don't do extra locking).
 > @@ -179,6 +226,8 @@ struct ipoib_dev_priv {
 >  	struct list_head child_intfs;
 >  	struct list_head list;
 >  
 > +	struct ipoib_cm_dev_priv cm;
 > +

outside of CONFIG_INFINIBAND_IPOIB_CM (so struct ipoib_dev_priv is
significantly larger even with CM off), but this:

 > +#ifdef CONFIG_INFINIBAND_IPOIB_CM
 > +
 > +#define IPOIB_FLAGS_RC          0x80
 > +#define IPOIB_FLAGS_UC          0x40

is inside?

 > +#define IPOIB_CM_ENABLED(ha)   (ha[0] & IPOIB_FLAGS_RC)

Should that be

+#define IPOIB_CM_ENABLED(ha)   (ha[0] & (IPOIB_FLAGS_RC | IPOIB_FLAGS_UC))

(I know you don't implement UC at all but if you're going to define
the flag, there's no point in setting a trap for the future...)

 - R.


From or.gerlitz at gmail.com  Tue Dec  5 13:09:15 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 5 Dec 2006 23:09:15 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1164910957.14800.71.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
Message-ID: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>

On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
> diff -r c76ed2f1387b include/rdma/ib_verbs.h
> --- a/include/rdma/ib_verbs.h   Wed Nov 29 13:28:14 2006 +0800
> +++ b/include/rdma/ib_verbs.h   Wed Nov 29 13:54:37 2006 -0800
> +struct ib_dma_mapping_ops {
> +       int             (*mapping_error)(struct ib_device *dev,
> +                                        u64 dma_addr);
> +       u64             (*map_single)(struct ib_device *dev,
> +                                     void *ptr, size_t size,
> +                                     enum dma_data_direction direction);
> +       void            (*unmap_single)(struct ib_device *dev,
> +                                       u64 addr, size_t size,
> +                                       enum dma_data_direction direction);
> +       u64             (*map_page)(struct ib_device *dev,
> +                                   struct page *page, unsigned long offset,
> +                                   size_t size,
> +                                   enum dma_data_direction direction);
> +       void            (*unmap_page)(struct ib_device *dev,
> +                                     u64 addr, size_t size,
> +                                     enum dma_data_direction direction);
> +       int             (*map_sg)(struct ib_device *dev,
> +                                 struct scatterlist *sg, int nents,
> +                                 enum dma_data_direction direction);
> +       void            (*unmap_sg)(struct ib_device *dev,
> +                                   struct scatterlist *sg, int nents,
> +                                   enum dma_data_direction direction);
> +       u64             (*dma_address)(struct ib_device *dev,
> +                                      struct scatterlist *sg);
> +       unsigned int    (*dma_len)(struct ib_device *dev,
> +                                  struct scatterlist *sg);
> +       void            (*sync_single_for_cpu)(struct ib_device *dev,
> +                                              u64 dma_handle,
> +                                              size_t size,
> +                                              enum dma_data_direction dir);
> +       void            (*sync_single_for_device)(struct ib_device *dev,
> +                                                 u64 dma_handle,
> +                                                 size_t size,
> +                                                 enum dma_data_direction dir);
>  };

This structure misses some functions which are members of  struct
dma_mapping_ops.

The most notable miss to me is dma_alloc/free_coherent, please note
that an IB consumer can call dma_alloc_coherent and place the resulted
dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS
code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct
usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under
under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds

Also I see in struct dma_mapping_ops also something called
dma_map_simple not sure what it does and who can use it.

Or.


From or.gerlitz at gmail.com  Tue Dec  5 13:21:29 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 5 Dec 2006 23:21:29 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adad56ydqpz.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
Message-ID: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>

On 12/5/06, Roland Dreier <rdreier at cisco.com> wrote:
> I think this seems reasonable.  And I think it also provides a way to
> address some hypothetical future situation where lowmem pages don't
> have a kernel virtual address -- you would just have to use this
> type of cookie implementation everywhere.

Such an approach would be much more cleaner and result in much less
(~zero changes) in the ulp level, just replace dma_map_xxx calls with
ib_dma_map_xxx calls.

A problem  see with the dma_addr_t being a cookie into a table of kv
addresses is that its legal for a consumer to use dma_addr_t with an
**offset** . So she gets addr y from ib_dma_map_xxx and then uses y +
offset in the SGE provided to ibv_post_send/recv or to the fmr map
function.

So this table is actually a search tree which allows you to match an
offset-ed dma_addr_t returned by dma_map_xxx called by ipath
ib_dma_map_xxx with its associated kvaddr.

I see now that i have managed to confuse myself b/c as Roland wrote
below and i have agreed we don't actually have the kv addr for and
unmapped page before the ipath driver maps it ie when it attempt to
use the page... It becomes late here... am i inventing a non existant
problem with the offset?

> (Although I don't think using kmap()/kunmap() is really the right
> approach -- you should probably just do kmap_atomic()/kunmap_atomic()
> only while you are actually using the page.  But the basic approach of
> using the dma address as a cookie into a mapping table seems sound to
> me -- you are basically doing a real sw iotlb)
>
> Or -- does this seem reasonable to you?

I agree that care should be made to do kmap_atomic/kunmap_atomic only
when there is actual need to access the page by the ipath driver.

Or.


From or.gerlitz at gmail.com  Tue Dec  5 13:24:55 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Tue, 5 Dec 2006 23:24:55 +0200
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
Message-ID: <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>

On 12/1/06, Sean Hefty <sean.hefty at intel.com> wrote:
> The following set of patches expand the rdma_cm support to include
> UDP port space, and expose the rdma_cm to userspace.  Multicast
> support has been removed from the patches until the ib_multicast
> module can be further debugged.
>
> Adding in multicast support later will result in new APIs and an
> ABI bump, but I do not anticipate multicast changing any of the
> existing interfaces.  (I'm also less confident that the multicast
> ABIs are correct.)
>
> Without the multicast interfaces, I believe what's left is ready to
> merge upstream.

Ronald,

What's the status of this patchset? it would be somehow very usefull
to have rdma cm user space support enablement in 2.6.20 and without
the multicast code i don't see why not merging it.

Or.


From rdreier at cisco.com  Tue Dec  5 13:28:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 13:28:20 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
	(Or Gerlitz's message of "Tue, 5 Dec 2006 23:09:15 +0200")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
Message-ID: <adamz62c9sb.fsf@cisco.com>

 > This structure misses some functions which are members of  struct
 > dma_mapping_ops.

I don't think we have to wrap every possible function if no IB
consumer uses it.

 > The most notable miss to me is dma_alloc/free_coherent, please note
 > that an IB consumer can call dma_alloc_coherent and place the resulted
 > dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS
 > code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct
 > usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under
 > under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds

Given that some use of the dma_alloc_coherent interface exists though,
I do think it makes sense to wrap it.  So Ralph can you please add
that to your resubmission too (in addition to fixing the sync_single
issue).

Any other issues Or?  (BTW thanks for helping review this and pointing
out some good issues)


From rdreier at cisco.com  Tue Dec  5 13:31:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 13:31:20 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	(Or Gerlitz's message of "Tue, 5 Dec 2006 23:21:29 +0200")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
Message-ID: <adairgqc9nb.fsf@cisco.com>

 > A problem  see with the dma_addr_t being a cookie into a table of kv
 > addresses is that its legal for a consumer to use dma_addr_t with an
 > **offset** . So she gets addr y from ib_dma_map_xxx and then uses y +
 > offset in the SGE provided to ibv_post_send/recv or to the fmr map
 > function.

Yes, that is a little bit of an issue.  But I think it just means the
ipath driver needs to keep page tables exactly the way an IOTLB would
-- ugly but not impossible to handle.

 > I see now that i have managed to confuse myself b/c as Roland wrote
 > below and i have agreed we don't actually have the kv addr for and
 > unmapped page before the ipath driver maps it ie when it attempt to
 > use the page... It becomes late here... am i inventing a non existant
 > problem with the offset?

The dma address doesn't have to be a kvaddr -- it is purely an address
space defined by the low-level driver.

 - R.


From rdreier at cisco.com  Tue Dec  5 13:32:11 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 13:32:11 -0800
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>
	(Or Gerlitz's message of "Tue, 5 Dec 2006 23:24:55 +0200")
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>
Message-ID: <adaejrec9lw.fsf@cisco.com>

 > What's the status of this patchset? it would be somehow very usefull
 > to have rdma cm user space support enablement in 2.6.20 and without
 > the multicast code i don't see why not merging it.

I would like to merge it, but I need to find time to read it over
carefully.  Have you read this patch set over?  Do you have any
comments about anything?

 - R.


From rdreier at cisco.com  Tue Dec  5 13:52:14 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 13:52:14 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
Message-ID: <adavekqau41.fsf@cisco.com>

Reading a little more:

 > +	/* Simple heuristic: dev->mtu > 2K ==> connected mode */

I'm not sure this is such a good idea.  I think it's setting a trap
for people if we have magic behavior -- eg just imagine the questions
if changing the MTU makes multicast stop working.

 - R.


From ralph.campbell at qlogic.com  Tue Dec  5 13:58:16 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 05 Dec 2006 13:58:16 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adapsb3ky1r.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<adapsb3ky1r.fsf@cisco.com>
Message-ID: <1165355896.14800.185.camel@brick.pathscale.com>

On Fri, 2006-12-01 at 15:15 -0800, Roland Dreier wrote:
> Oh yeah, one other thing...
> 
> could you respin this so that all the new dma_xxx wrappers go into a
> new file like <rdma/ib_dma_mapping.h> (and include that from
> <rdma/ib_verbs.h>)?  ib_verbs.h is already too big I think.

I can move the definition for struct ib_dma_mapping_ops to a
separate header file but if I move the inline functions
and include the header file at the top of ib_verbs.h,
then the struct ib_device is not defined and the compiler
complains.  I could put the #include <rdma/ib_dma_mapping.h>
after the definition of struct ib_device but I'm not sure
how acceptable that is for coding style.

Do you still want me to make this change?


From ralph.campbell at qlogic.com  Tue Dec  5 14:20:52 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 05 Dec 2006 14:20:52 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
Message-ID: <1165357252.14800.192.camel@brick.pathscale.com>

On Tue, 2006-12-05 at 23:09 +0200, Or Gerlitz wrote:
> On 11/30/06, Ralph Campbell <ralph.campbell at qlogic.com> wrote:
> > diff -r c76ed2f1387b include/rdma/ib_verbs.h
> > --- a/include/rdma/ib_verbs.h   Wed Nov 29 13:28:14 2006 +0800
> > +++ b/include/rdma/ib_verbs.h   Wed Nov 29 13:54:37 2006 -0800
> > +struct ib_dma_mapping_ops {
> > +       int             (*mapping_error)(struct ib_device *dev,
> > +                                        u64 dma_addr);
> > +       u64             (*map_single)(struct ib_device *dev,
> > +                                     void *ptr, size_t size,
> > +                                     enum dma_data_direction direction);
> > +       void            (*unmap_single)(struct ib_device *dev,
> > +                                       u64 addr, size_t size,
> > +                                       enum dma_data_direction direction);
> > +       u64             (*map_page)(struct ib_device *dev,
> > +                                   struct page *page, unsigned long offset,
> > +                                   size_t size,
> > +                                   enum dma_data_direction direction);
> > +       void            (*unmap_page)(struct ib_device *dev,
> > +                                     u64 addr, size_t size,
> > +                                     enum dma_data_direction direction);
> > +       int             (*map_sg)(struct ib_device *dev,
> > +                                 struct scatterlist *sg, int nents,
> > +                                 enum dma_data_direction direction);
> > +       void            (*unmap_sg)(struct ib_device *dev,
> > +                                   struct scatterlist *sg, int nents,
> > +                                   enum dma_data_direction direction);
> > +       u64             (*dma_address)(struct ib_device *dev,
> > +                                      struct scatterlist *sg);
> > +       unsigned int    (*dma_len)(struct ib_device *dev,
> > +                                  struct scatterlist *sg);
> > +       void            (*sync_single_for_cpu)(struct ib_device *dev,
> > +                                              u64 dma_handle,
> > +                                              size_t size,
> > +                                              enum dma_data_direction dir);
> > +       void            (*sync_single_for_device)(struct ib_device *dev,
> > +                                                 u64 dma_handle,
> > +                                                 size_t size,
> > +                                                 enum dma_data_direction dir);
> >  };
> 
> This structure misses some functions which are members of  struct
> dma_mapping_ops.
> 
> The most notable miss to me is dma_alloc/free_coherent, please note
> that an IB consumer can call dma_alloc_coherent and place the resulted
> dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS
> code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct
> usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under
> under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds

This looks like a very different version of RDS from what was
in SVN a month ago.  The SVN version didn't call alloc_dma_coherent().

> Also I see in struct dma_mapping_ops also something called
> dma_map_simple not sure what it does and who can use it.
> 
> Or.

I don't see anything with "simple" in the name.
There is one call to dma_map_single() in the inline function
for ib_dma_map_single() if the ib_device.dma_ops is NULL.


From ralph.campbell at qlogic.com  Tue Dec  5 14:21:52 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 05 Dec 2006 14:21:52 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adamz62c9sb.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
	<adamz62c9sb.fsf@cisco.com>
Message-ID: <1165357312.14800.193.camel@brick.pathscale.com>

On Tue, 2006-12-05 at 13:28 -0800, Roland Dreier wrote:
>  > This structure misses some functions which are members of  struct
>  > dma_mapping_ops.
> 
> I don't think we have to wrap every possible function if no IB
> consumer uses it.
> 
>  > The most notable miss to me is dma_alloc/free_coherent, please note
>  > that an IB consumer can call dma_alloc_coherent and place the resulted
>  > dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS
>  > code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct
>  > usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under
>  > under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds
> 
> Given that some use of the dma_alloc_coherent interface exists though,
> I do think it makes sense to wrap it.  So Ralph can you please add
> that to your resubmission too (in addition to fixing the sync_single
> issue).

OK.

> Any other issues Or?  (BTW thanks for helping review this and pointing
> out some good issues)


From bugzilla-daemon at openib.org  Tue Dec  5 14:58:18 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Tue,  5 Dec 2006 14:58:18 -0800 (PST)
Subject: [openib-general] [Bug 286] "ifconfig ib# down" hangs telnet
	connection-- NETDEV WATCHDOG: ib0: transmit timed out
Message-ID: <20061205225818.BC7B02283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=286


------- Comment #2 from amir.vetry at sun.com  2006-12-05 14:58 -------
This issue was also reproducible with other Sun's platform (Andromeda's blade)
and OFED 1.1 driver.    The following are the system used to reproduced this
problem:

   - Linux 2.6.5-7.244-smp #1 SMP Mon Dec 12 18:32:25 UTC 2005 x86_64 x86_64  
     x86_64 GNU/Linux
   - SUSE LINUX Enterprise Server 9 (x86_64)
      VERSION = 9, PATCHLEVEL = 3 

Error message in /var/log/message*
=================================
      Nov 29 04:23:27  kernel: NETDEV WATCHDOG: ib0: transmit timed out
      Nov 29 04:23:27  kernel: ib0: transmit timeout: latency 6290010 msecs
      Nov 29 04:23:27  kernel: ib0: queue stopped 1, tx_head 1433713921,
tx_tail 
          1433713886

IB-HCA detail information:
==========================
hca_id: mthca0
        fw_ver:                         4.7.600
        node_guid:                      0002:c902:0021:83bc
        sys_image_guid:                 0003:ba00:0100:d050
        vendor_id:                      0x03ba
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       SUN0050000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ARMED (3)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 7
                        port_lid:               7
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 7
                        port_lid:               9
                        port_lmc:               0x00

hca_id: mthca1
        fw_ver:                         4.7.600
        node_guid:                      0002:c902:0040:0458
        sys_image_guid:                 0003:ba00:0100:d050
        vendor_id:                      0x03ba
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       SUN0050000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 7
                        port_lid:               2
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 7
                        port_lid:               3
                        port_lmc:               0x00


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From ralph.campbell at qlogic.com  Tue Dec  5 14:59:20 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 05 Dec 2006 14:59:20 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
Message-ID: <1165359560.14800.210.camel@brick.pathscale.com>

On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote:
> On 12/5/06, Roland Dreier <rdreier at cisco.com> wrote:
> > I think this seems reasonable.  And I think it also provides a way to
> > address some hypothetical future situation where lowmem pages don't
> > have a kernel virtual address -- you would just have to use this
> > type of cookie implementation everywhere.
> 
> Such an approach would be much more cleaner and result in much less
> (~zero changes) in the ulp level, just replace dma_map_xxx calls with
> ib_dma_map_xxx calls.
> 
> A problem  see with the dma_addr_t being a cookie into a table of kv
> addresses is that its legal for a consumer to use dma_addr_t with an
> **offset** . So she gets addr y from ib_dma_map_xxx and then uses y +
> offset in the SGE provided to ibv_post_send/recv or to the fmr map
> function.
> 
> So this table is actually a search tree which allows you to match an
> offset-ed dma_addr_t returned by dma_map_xxx called by ipath
> ib_dma_map_xxx with its associated kvaddr.
> 
> I see now that i have managed to confuse myself b/c as Roland wrote
> below and i have agreed we don't actually have the kv addr for and
> unmapped page before the ipath driver maps it ie when it attempt to
> use the page... It becomes late here... am i inventing a non existant
> problem with the offset?
> 
> > (Although I don't think using kmap()/kunmap() is really the right
> > approach -- you should probably just do kmap_atomic()/kunmap_atomic()
> > only while you are actually using the page.  But the basic approach of
> > using the dma address as a cookie into a mapping table seems sound to
> > me -- you are basically doing a real sw iotlb)
> >
> > Or -- does this seem reasonable to you?
> 
> I agree that care should be made to do kmap_atomic/kunmap_atomic only
> when there is actual need to access the page by the ipath driver.
> 
> Or.

I am not following what you two are saying.

The ib_dma_mapping_ops functions as implemented by ib_ipath,
are redefining dma_addr_t as a kernel virtual address.
When ib_dma_map_single() is called, this is a NOP.
When ib_dma_map_sg() is called, the dma_map_sg() replacement needs
to convert a struct page pointer into a kernel virtual address.
When CONFIG_HIGHMEM is defined, some pages may not be mapped
into the kernel virtual address space so the driver needs to
call kmap().  Since the driver can't use the struct scattergather
to store the kmap() result, a separate table needs to be used
so the value can be returned by ib_sg_dma_address().

Doing kmap_atomic() at the point where the kernel virtual
address is used is not practical since the driver is not
mapping dma_addr_t to struct page * although it is
possible to write it that way.  It would mean that
ib_map_single() would then be more complex in that a
kernel virtual address would need to be converted to a
struct page *.


From ralphc at pathscale.com  Tue Dec  5 15:10:56 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Tue, 05 Dec 2006 15:10:56 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1165359560.14800.210.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	<1165359560.14800.210.camel@brick.pathscale.com>
Message-ID: <1165360256.14800.213.camel@brick.pathscale.com>

On Tue, 2006-12-05 at 14:59 -0800, Ralph Campbell wrote:

> The ib_dma_mapping_ops functions as implemented by ib_ipath,
> are redefining dma_addr_t as a kernel virtual address.
> When ib_dma_map_single() is called, this is a NOP.
> When ib_dma_map_sg() is called, the dma_map_sg() replacement needs
> to convert a struct page pointer into a kernel virtual address.
> When CONFIG_HIGHMEM is defined, some pages may not be mapped
> into the kernel virtual address space so the driver needs to
> call kmap().  Since the driver can't use the struct scattergather
> to store the kmap() result, a separate table needs to be used
> so the value can be returned by ib_sg_dma_address().
> 
> Doing kmap_atomic() at the point where the kernel virtual
> address is used is not practical since the driver is not
> mapping dma_addr_t to struct page * although it is
> possible to write it that way.  It would mean that
> ib_map_single() would then be more complex in that a
> kernel virtual address would need to be converted to a
> struct page *.

I forgot this last part.

Making dma_addr_t a kernel virtual address does allow
the result to be offset (at least within a page)
but making dma_addr_t a struct page pointer doesn't.


From rdreier at cisco.com  Tue Dec  5 15:45:08 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 05 Dec 2006 15:45:08 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1165355896.14800.185.camel@brick.pathscale.com> (Ralph
	Campbell's message of "Tue, 05 Dec 2006 13:58:16 -0800")
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<adapsb3ky1r.fsf@cisco.com>
	<1165355896.14800.185.camel@brick.pathscale.com>
Message-ID: <adamz61c3gb.fsf@cisco.com>

 > I can move the definition for struct ib_dma_mapping_ops to a
 > separate header file but if I move the inline functions
 > and include the header file at the top of ib_verbs.h,
 > then the struct ib_device is not defined and the compiler
 > complains.  I could put the #include <rdma/ib_dma_mapping.h>
 > after the definition of struct ib_device but I'm not sure
 > how acceptable that is for coding style.
 > 
 > Do you still want me to make this change?

No, that's OK.

I have a few ideas, but let's merge this basically as is and then we
can play around with it further.

 - R.


From krause at cup.hp.com  Tue Dec  5 17:27:14 2006
From: krause at cup.hp.com (Michael Krause)
Date: Tue, 05 Dec 2006 17:27:14 -0800
Subject: [openib-general] [PATCH  v2 04/13] Connection Manager
In-Reply-To: <20061205180939.GA26384@2ka.mipt.ru>
References: <20061205050725.GA26033@2ka.mipt.ru>
	<1165330925.16087.13.camel@stevo-desktop>
	<20061205151905.GA18275@2ka.mipt.ru>
	<1165333198.16087.53.camel@stevo-desktop>
	<20061205155932.GA32380@2ka.mipt.ru>
	<1165335162.16087.79.camel@stevo-desktop>
	<20061205163008.GA30211@2ka.mipt.ru>
	<1165337245.16087.95.camel@stevo-desktop>
	<20061205172649.GA20229@2ka.mipt.ru>
	<1165341100.16087.109.camel@stevo-desktop>
	<20061205180939.GA26384@2ka.mipt.ru>
Message-ID: <6.2.0.14.2.20061205172536.086fa438@esmail.cup.hp.com>


If you require more details on how this all works - it was fully explored 
in the IETF RDDP workgroup - may I suggest a reading of the RDMA Security 
Considerations draft which goes through many of the issues on how one 
relates to a host stack.   This complements the MPA spec and supports much 
of what Steve has already responded to during this string of e-mails.  We 
took a great deal of time and debate to insure this can work efficiently 
and without confusion in terms of who owns what and when.

Mike


At 10:09 AM 12/5/2006, Evgeniy Polyakov wrote:
>On Tue, Dec 05, 2006 at 11:51:40AM -0600, Steve Wise 
>(swise at opengridcomputing.com) wrote:
> > > Almost - except the case about where those skbs are coming from?
> > > It looks like they are obtained from network, since it is ethernet
> > > driver, and if they match some set of rules, they are considered as 
> valid
> > > MPA negotiation protocol.
> >
> > They come from the Ethernet driver, but that driver manages multiple HW
> > queues and these packets come from an offload queue, not the NIC queue.
> > So the HW demultiplexes.
>
>Ok, thanks for explaination.
>
>--
>         Evgeniy Polyakov
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit 
>http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Tue Dec  5 23:20:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 09:20:02 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <ada1wnedq99.fsf@cisco.com>
References: <ada1wnedq99.fsf@cisco.com>
Message-ID: <20061206072002.GB26787@mellanox.co.il>


Roland, thanks for the comments, I'll work on addressing them.
Regarding your question:

>  > +	IPOIB_FLAG_NETIF_STOPPED  = 9,
> 
> I can't follow what this is used for.  Can you explain in small words?

Send Q overrun prevention. Current code stop the interface if send queue gets full,
and start it again after sufficient number of send completions arrives.
I generalized it to: stop interface if *some* send queue becomes full,
and start it again after send completions *for that send queue* arrive.

So when I get send completion, I need to know that the interface was stopped because
*this* queue was full, and start the interface only in this case.

-- 
MST


From mst at mellanox.co.il  Tue Dec  5 23:26:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 09:26:04 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <adavekqau41.fsf@cisco.com>
References: <adavekqau41.fsf@cisco.com>
Message-ID: <20061206072604.GC26787@mellanox.co.il>

> Reading a little more:
> 
>  > +	/* Simple heuristic: dev->mtu > 2K ==> connected mode */
> 
> I'm not sure this is such a good idea.  I think it's setting a trap
> for people if we have magic behavior -- eg just imagine the questions
> if changing the MTU makes multicast stop working.

I know. Still, this only happens if you enable CM. Maybe it will help
to mention this in the comment in KConfig? Log a message as well?

What do you think?

I have a notion that once this code is upstream we can work on
ways to teach kernel about net devices where MTU changes dynamically.
Or possibly, some tricks with icmp can make it work.

-- 
MST


From mst at mellanox.co.il  Tue Dec  5 23:29:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 09:29:34 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <ada1wnedq99.fsf@cisco.com>
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il> <ada1wnedq99.fsf@cisco.com>
Message-ID: <20061206072934.GD26787@mellanox.co.il>

> Quoting r. Roland Dreier <rdreier at cisco.com>:
> Subject: Re: [PATCH] IPoIB CM Experimental support
> 
> OK, just a very quick scan through:
> 
>  > +ib_ipoib-$(INFINIBAND_IPOIB_CM)			+= ipoib_cm.o
> 
> Does this actually work in the Makefile without the CONFIG_ prefix?  I
> don't think it's intended anyway...

It does seem to work (try it :) ), but I agree this should be fixed.

-- 
MST


From ogerlitz at voltaire.com  Wed Dec  6 00:03:40 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 10:03:40 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1165357252.14800.192.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
	<1165357252.14800.192.camel@brick.pathscale.com>
Message-ID: <4576795C.2050903@voltaire.com>

Ralph Campbell wrote:
> On Tue, 2006-12-05 at 23:09 +0200, Or Gerlitz wrote:
>> The most notable miss to me is dma_alloc/free_coherent, please note
>> that an IB consumer can call dma_alloc_coherent and place the resulted
>> dma_addr_t within an SGE provided to ibv_post_send/recv, see the RDS
>> code doing the allocation at ib_cm.c :: rds_ib_setup_qp and the direct
>> usage of the dma_addr_t at ib_recv :: rds_ib_recv_init_ring under
>> under http://oss.oracle.com/projects/rds/src/trunk/linux/net/rds
> 
> This looks like a very different version of RDS from what was
> in SVN a month ago.  The SVN version didn't call alloc_dma_coherent().

Two comments: a) the SVN kernel IB code is not something you should look 
on, its unmaintained b) the SVN RDS code is the 2nd generation RDS code 
and is a dead one, RDS is now developed within Oracle Linux group and 
the code is what found under the oss.oracle.com pointer above.

>> Also I see in struct dma_mapping_ops also something called
>> dma_map_simple not sure what it does and who can use it.

> I don't see anything with "simple" in the name.
> There is one call to dma_map_single() in the inline function
> for ib_dma_map_single() if the ib_device.dma_ops is NULL.

I was looking in include/asm-x86_64/dma-mapping.h and there was this 
map_simple prototype... anyway forget about it.

Or.


From mst at mellanox.co.il  Wed Dec  6 00:03:17 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:03:17 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <000601c714e0$d955fef0$92cc180a@amr.corp.intel.com>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<000601c714e0$d955fef0$92cc180a@amr.corp.intel.com>
Message-ID: <20061206080317.GG26787@mellanox.co.il>

> To handle the case
> where the connection messages are lost, a new API is added that users
> may invoke to force a connection into the established state.

Just to clarify this point - what connecton messages can be lost?
E.g. if the passive side does not get an RTU for a while, it will
retry the REP, won't it?  Diagram 12.9.6 seems to indicate so:
from REP Sent we should go to RTU timeout, Send REP and back to REP Sent.
Is this implemented?

-- 
MST


From mst at mellanox.co.il  Wed Dec  6 00:17:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:17:27 +0200
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
	support
In-Reply-To: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
Message-ID: <20061206081727.GH26787@mellanox.co.il>

> The following set of patches expand the rdma_cm support to include
> UDP port space, and expose the rdma_cm to userspace.  Multicast
> support has been removed from the patches until the ib_multicast
> module can be further debugged.
> 
> Adding in multicast support later will result in new APIs and an
> ABI bump, but I do not anticipate multicast changing any of the
> existing interfaces.  (I'm also less confident that the multicast
> ABIs are correct.)
> 
> Without the multicast interfaces, I believe what's left is ready to
> merge upstream.

I agree.

Further, limited UCMA testing done on a very similiar codebase in OFED 1.1 did not
turn up any issues, and CMA updates address API issues we have seen with SDP.

Acked-by: Michael S. Tsirkin <mst at mellanox.co.il>

-- 
MST


From ogerlitz at voltaire.com  Wed Dec  6 00:21:14 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 10:21:14 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <20061206080317.GG26787@mellanox.co.il>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<000601c714e0$d955fef0$92cc180a@amr.corp.intel.com>
	<20061206080317.GG26787@mellanox.co.il>
Message-ID: <45767D7A.7040502@voltaire.com>

Michael S. Tsirkin wrote:
>> To handle the case
>> where the connection messages are lost, a new API is added that users
>> may invoke to force a connection into the established state.
> 
> Just to clarify this point - what connecton messages can be lost?
> E.g. if the passive side does not get an RTU for a while, it will
> retry the REP, won't it?  Diagram 12.9.6 seems to indicate so:
> from REP Sent we should go to RTU timeout, Send REP and back to REP Sent.
> Is this implemented?

It handles the case where the first RX crosses the RTU which can happen 
when the RTU is lost but also without it being lost.

Indeed the passive side would resend the REP when a timeout expires but 
the patch allows the app to force the connection establishment **now** 
(ie have the CMA move the RC QP to RTS) and not go into queuing of RX-es 
etc till the RTU is lost, it also handles the case where all the RTUs 
are lost.

Or.


From mst at mellanox.co.il  Wed Dec  6 00:22:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:22:42 +0200
Subject: [openib-general] userspace git conversion status/cut over
In-Reply-To: <20061130191717.GJ18978@sashak.voltaire.com>
References: <1164897683.11808.129709.camel@hal.voltaire.com>
	<456F0AE3.4060209@ichips.intel.com>
	<20061130191717.GJ18978@sashak.voltaire.com>
Message-ID: <20061206082242.GI26787@mellanox.co.il>

> Other issue. There is /pub/scm/linux-2.6.18/.git tree, looks it was used
> for git installation testing or so.
> 
> Does somebody use it? Could this be (re)moved?

No one seemed to care, and 2.6.19 is out anyway :)
Let's kill it then.

-- 
MST


From mst at mellanox.co.il  Wed Dec  6 00:28:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:28:57 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <45767D7A.7040502@voltaire.com>
References: <45767D7A.7040502@voltaire.com>
Message-ID: <20061206082857.GJ26787@mellanox.co.il>

> Michael S. Tsirkin wrote:
> >> To handle the case
> >> where the connection messages are lost, a new API is added that users
> >> may invoke to force a connection into the established state.
> > 
> > Just to clarify this point - what connecton messages can be lost?
> > E.g. if the passive side does not get an RTU for a while, it will
> > retry the REP, won't it?  Diagram 12.9.6 seems to indicate so:
> > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent.
> > Is this implemented?
> 
> It handles the case where the first RX crosses the RTU which can happen 
> when the RTU is lost but also without it being lost.
> 
> Indeed the passive side would resend the REP when a timeout expires but 
> the patch allows the app to force the connection establishment **now** 
> (ie have the CMA move the RC QP to RTS) and not go into queuing of RX-es 
> etc till the RTU is lost, it also handles the case where all the RTUs 
> are lost.

I think we all already agreed we need the rdma_established call,
for reasons that you outline. So I am not arguing at all - I was just
checking that REP re-sends are implemented.

So, a slightly more exact description for the patch would be
"to handle the case where a data packet bypasses an RTU".
Is that right?

-- 
MST


From mst at mellanox.co.il  Wed Dec  6 00:34:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:34:27 +0200
Subject: [openib-general] OFED 1.2 features update
In-Reply-To: <4575D0A8.7080501@ichips.intel.com>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
Message-ID: <20061206083427.GL26787@mellanox.co.il>

> > BTW - where are those trees located?
> 
> My trees are available from the staging.openfabrics.org/git site.  I called the 
> kernel tree rdma-dev.

Thanks, Sean!
I gather the ucma bits are in rdma_ucm?

-- 
MST


From mst at mellanox.co.il  Wed Dec  6 00:49:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 10:49:02 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <OF09AC5817.ACA3A1EE-ON8725723B.0063B936-8825723B.0063E81E@us.ibm.com>
References: <OF09AC5817.ACA3A1EE-ON8725723B.0063B936-8825723B.0063E81E@us.ibm.com>
Message-ID: <20061206084902.GN26787@mellanox.co.il>

> >The idea is to increase performance by increasing the MTU
> >from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD.
> >With this code, I'm able to get 800MByte/sec or more with netperf
> >without options on a Mellanox 4x back-to-back DDR system.
> 
> What about CPU utilization?

Seems to be about the same (about 100% of a single CPU).

UD:

# /mswg/work/mst/netperf-2.4.2/src/netperf -H 11.4.3.69 -f M -c -C
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.69 (11.4.3.69) port 0 AF_INET : demo
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    MBytes  /s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       276.80   27.98    25.55    3.948   3.606

RC:
# /mswg/work/mst/netperf-2.4.2/src/netperf -H 11.4.3.69 -f M -c -C
TREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.69 (11.4.3.69) port 0
AF_INET : demo
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    MBytes  /s  % S      % S      us/KB   us/KB

 87380  16384  16384    10.00       907.68   25.08    24.43    1.079   1.052

-- 
MST


From ogerlitz at voltaire.com  Wed Dec  6 01:09:06 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 11:09:06 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <20061206082857.GJ26787@mellanox.co.il>
References: <45767D7A.7040502@voltaire.com>
	<20061206082857.GJ26787@mellanox.co.il>
Message-ID: <457688B2.8040704@voltaire.com>

Michael S. Tsirkin wrote:
> I think we all already agreed we need the rdma_established call,
> for reasons that you outline. So I am not arguing at all - I was just
> checking that REP re-sends are implemented.

Yes, and its not "the rdma_established call" but "an rdma_established" 
call. Sean has changed the name to cm_notify and rdma_notify as it 
merges within the framework of other ULP to CMA/CM notifications eg 
those related to path migration.

  > So, a slightly more exact description for the patch would be
> "to handle the case where a data packet bypasses an RTU".
> Is that right?

Yes.

Or.


From ogerlitz at voltaire.com  Wed Dec  6 01:52:37 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 11:52:37 +0200
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <adaejrec9lw.fsf@cisco.com>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>
	<adaejrec9lw.fsf@cisco.com>
Message-ID: <457692E5.2050800@voltaire.com>

Roland Dreier wrote:
>  > What's the status of this patchset? it would be somehow very usefull
>  > to have rdma cm user space support enablement in 2.6.20 and without
>  > the multicast code i don't see why not merging it.
> 
> I would like to merge it, but I need to find time to read it over
> carefully.  Have you read this patch set over?  Do you have any
> comments about anything?

+ 1/5 is a small fix discussed over the list

+ 2/5 provides a functionality needed by CMA consumer and does not have 
any impact on anything below the CMA

+ 3/5 is a solution for the IB race of data crossing the RTU and is the 
outcome of a very long discussion over the list. The approach taken is 
very clean and easy to integrate for CM/CMA consumers. A similar patch 
was integrated into OFED 1.1 so it closes a hole where a passive side 
CM/CMA consumers wanting to handle this case easily were not able to do 
so with the kernel CM/CMA code, we must need it for 2.6.20 to close this 
gap.

+ 4/5 adds CMA "UD offload" support using SIDR REQ/REP to exchange the 
QP and Path information. I did not experience much with the patch other 
then running the librdmacm uddady test program but have reviewed it 
without having any special comments.

+ 5/5 is the CMA user space support. I only did a light review of it but 
  my understanding is that Sean used the in kernel ib_ucm design/code as 
the base line for this driver so there should be no special issues here.
This driver is long time missing in the kernel IB offer, as it enables 
using the user space rdma cm (librdmacm) which more and more becomes a 
must have in the IB package of today's distros - it better go in now.

Actually i did most of my review and testing on the multicast code which 
is not in this patch set. I have provided feedback over the list which 
made its way into v2 of the patches and more feedback 1x1 to Sean during 
the sc06 devcon.

Sean - as of the stability issues reported by Mellanox I understand you 
have decided not to push the multicast code for 2.6.20 and I see that 
the focus now is on finding the bug. Once this is solved I would like to 
provide more feedback before you publish v3 - does it makes sense?

Or.


From muli at il.ibm.com  Wed Dec  6 01:52:55 2006
From: muli at il.ibm.com (Muli Ben-Yehuda)
Date: Wed, 6 Dec 2006 11:52:55 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<15ddcffd0612051309s1c2b4d3cm558ae4c6479c6503@mail.gmail.com>
Message-ID: <20061206095255.GA6564@rhun.haifa.ibm.com>

On Tue, Dec 05, 2006 at 11:09:15PM +0200, Or Gerlitz wrote:

> Also I see in struct dma_mapping_ops also something called
> dma_map_simple not sure what it does and who can use it.

It's an internal implementation detail of how the GART IOMMU uses the
DMA maping ops which doesn't belong in the dma_ops structure.

Cheers,
Muli


From ogerlitz at voltaire.com  Wed Dec  6 01:58:43 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 11:58:43 +0200
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
	RDMA CM etc
In-Reply-To: <20061206083427.GL26787@mellanox.co.il>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il>
Message-ID: <45769453.3030509@voltaire.com>

Michael S. Tsirkin wrote:
>>> BTW - where are those trees located?
>> My trees are available from the staging.openfabrics.org/git site.  I called the 
>> kernel tree rdma-dev.
> 
> Thanks, Sean!
> I gather the ucma bits are in rdma_ucm?

We must make some plan here.

1st decide on which kernel OFED 1.2 is based 2.6.19 or 2.6.20.rc1

2nd see if the rdma_ucm bits make their way into 2.6.20

3rd have Sean decide how he wants the multicast support to be integrated 
into OFED 1.2, my guess would be as a patch set over the 
ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide

4th have a release of librdmacm that supports RC/UD/UD mcast "offload" -

Or.


From eitan at mellanox.co.il  Wed Dec  6 02:17:34 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 12:17:34 +0200
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or switch]
Message-ID: <457698BE.10907@mellanox.co.il>

Hi Hal

The simulated regression caught this:
The osm.mcfdbs have now the format:
Switch 0x0002c90000000006
LID    : Out Port(s)
0xC000 : 0x003  0x004  0x005  0x006
0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 
:0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 
:0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B 
:0xC01C :0xC01D :0xC01E :0xC01F :

Which should probably just be:
Switch 0x0002c90000000006
LID    : Out Port(s)
0xC000 : 0x003  0x004  0x005  0x006

Actually switches that do not have any MCG entry will not be included
in the dump file.

The following patch fixes that.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_mcast_mgr.c
===================================================================
--- opensm/osm_mcast_mgr.c    (revision 10188)
+++ opensm/osm_mcast_mgr.c    (working copy)
@@ -1389,10 +1389,13 @@ mcast_mgr_dump_sw_routes(
  int16_t               mlid_start_ho;
  uint8_t               position = 0;
  int16_t               block_num = 0;
-  boolean_t             print_lid;
+  boolean_t             first_mlid;
+  boolean_t             first_port;
  const osm_node_t*     p_node;
  uint16_t              i, j;
  uint16_t              mask_entry;
+  char                  sw_hdr[256];
+  char                  mlid_hdr[32];

  OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes );
 
@@ -1403,9 +1406,10 @@ mcast_mgr_dump_sw_routes(

  p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw );

-  fprintf( file, "\nSwitch 0x%016" PRIx64 "\n"
+  sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n"
           "LID    : Out Port(s)\n",
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); 
+           cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
+  first_mlid = TRUE;
  while ( block_num <= p_tbl->max_block_in_use )
  {
    mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE);
@@ -1413,8 +1417,8 @@ mcast_mgr_dump_sw_routes(
    {
      mlid_ho = mlid_start_ho + i;
      position = 0;
-      print_lid = FALSE;
-      fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO );
+      first_port = TRUE;
+      sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO );
      while ( position <= p_tbl->max_position )
      {
        mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]);
@@ -1423,17 +1427,27 @@ mcast_mgr_dump_sw_routes(
          position++;
          continue;
        }
-        print_lid = TRUE;
        for (j = 0 ; j < 16 ; j++)
        {
-          if ( (1 << j) & mask_entry )
-            fprintf( file, " 0x%03X ", j+(position*16) );
+              if ( (1 << j) & mask_entry ) {
+              if (first_mlid)
+              {
+                 fprintf( file,"%s", sw_hdr );
+                 first_mlid = FALSE;
+              }
+              if (first_port)
+              {
+                 fprintf( file,"%s", mlid_hdr );
+                 first_port = FALSE;
+              }
+                  fprintf( file, " 0x%03X ", j+(position*16) );
+              }
        }
        position++;
      }
-      if (print_lid)
+      if (first_port == FALSE)
      {
-        fprintf( file, "\n" );
+         fprintf( file, "\n" );
      }
    }
    block_num++;


From mst at mellanox.co.il  Wed Dec  6 02:17:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 12:17:05 +0200
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
	RDMA CM etc
In-Reply-To: <45769453.3030509@voltaire.com>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com>
Message-ID: <20061206101705.GP26787@mellanox.co.il>

> >>> BTW - where are those trees located?
> >> My trees are available from the staging.openfabrics.org/git site.  I called the 
> >> kernel tree rdma-dev.
> > 
> > Thanks, Sean!
> > I gather the ucma bits are in rdma_ucm?
> 
> We must make some plan here.
> 
> 1st decide on which kernel OFED 1.2 is based 2.6.19 or 2.6.20.rc1

1st is probably to fix the mcast bits so that they don't crash the machine.
OFED will be based on whatever is merged by Linus by that time + any number of patches
and out of kernel modules.

> 2nd see if the rdma_ucm bits make their way into 2.6.20

Until that's closed we can keep stuff in patches, assuming its reasonably stable
(as in - does not interfere with other work).

> 3rd have Sean decide how he wants the multicast support to be integrated 
> into OFED 1.2, my guess would be as a patch set over the 
> ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide

Yes. The idea is to have in OFED linus' tree + any number of additional files +
any number of patches.

The point of this is that merges from upstream must be seamless, and if they
break something I know which patch to blame.
Makefile conflicts I can handle so Makefile additions even in core can go in.

> 4th have a release of librdmacm that supports RC/UD/UD mcast "offload" -

Need to also think how whatever library OFED ships will work on current and
future upstream kernels.  I would like to see some plan that will ensure
backward compatibility for tools that do not use multicast.

Maybe the right thing is to split the multicast stuff in a separate library,
or have a separate ABI version for multicast, I don't really know.

-- 
MST


From eitan at mellanox.co.il  Wed Dec  6 02:21:22 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 12:21:22 +0200
Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query with
	zero LID
Message-ID: <457699A2.9070206@mellanox.co.il>

Hi Hal,

This is another catch from the nightly simulator based regression.
Simple: if OpenSM gets a PathRecord that eventually maps into a port 
with zero LID (either SRC or DST)
if just asserts (in debug mode) on getting the LFT.

The following patch catches this error.

EZ

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Index: opensm/osm_sa_path_record.c
===================================================================
--- opensm/osm_sa_path_record.c    (revision 10188)
+++ opensm/osm_sa_path_record.c    (working copy)
@@ -976,6 +976,22 @@ __osm_pr_rcv_get_port_pair_paths(
                                &src_lid_max_ho );
   }
 
+  if ( src_lid_min_ho == 0 )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_pr_rcv_get_port_pair_paths: ERR 1F20:"
+             "Obtained zero source LID. No such LID possible.\n");
+     goto Exit;
+  }
+
+  if ( dest_lid_min_ho == 0 )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_pr_rcv_get_port_pair_paths: ERR 1F21:"
+             "Obtained zero destination LID. No such LID possible.\n");
+     goto Exit;
+  }
+
   if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
   {
     osm_log( p_rcv->p_log, OSM_LOG_DEBUG,


From eitan at mellanox.co.il  Wed Dec  6 02:23:24 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 12:23:24 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down
 routing engines
In-Reply-To: <1165347459.25587.78224.camel@hal.voltaire.com>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802302048-git-send-email-sashak@voltaire.com>
	<1165347459.25587.78224.camel@hal.voltaire.com>
Message-ID: <45769A1C.2090406@mellanox.co.il>

Hal Rosenstock wrote:
> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
>   
>> This updates "file" and "updn" (up/down) routing engines which should
>> work properly now with changed LFT setup mechanism.
>>
>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>     
>
> Thanks. Applied.
>   
Are these patches inserted into SVN or GIT ?

Eitan
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Wed Dec  6 02:39:07 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 12:39:07 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down
 routing engines
In-Reply-To: <45769A1C.2090406@mellanox.co.il>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802302048-git-send-email-sashak@voltaire.com>
	<1165347459.25587.78224.camel@hal.voltaire.com>
	<45769A1C.2090406@mellanox.co.il>
Message-ID: <45769DCB.7010105@mellanox.co.il>

Eitan Zahavi wrote:
> Hal Rosenstock wrote:
>   
>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
>>   
>>     
>>> This updates "file" and "updn" (up/down) routing engines which should
>>> work properly now with changed LFT setup mechanism.
>>>
>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>>     
>>>       
>> Thanks. Applied.
>>   
>>     
> Are these patches inserted into SVN or GIT 
>   
Ignore this - just cloned GIT and its there
> Eitan
>   
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From chevchenkovic at gmail.com  Wed Dec  6 02:52:23 2006
From: chevchenkovic at gmail.com (Chevchenkovic Chevchenkovic)
Date: Wed, 6 Dec 2006 16:22:23 +0530
Subject: [openib-general] Forwarding tables
Message-ID: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com>

Hi,
   I would like to write my own forwarding table to be used by openSM.
I hope some expert here would help me out in this.
1.  How do I write the new table in a file. What is the format used
and wht are the commands to be used while loading openS?
2; Which part of the code should I modify so as to incorporate this
changing of linear forwarding tables in the code itself.
Help would b very much apreciared.
Best Wishes,
-Chev


From halr at voltaire.com  Wed Dec  6 03:11:47 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 06:11:47 -0500
Subject: [openib-general] [PATCH 5/5] opensm: updates file and up/down
 routing engines
In-Reply-To: <45769DCB.7010105@mellanox.co.il>
References: <11645802043173-git-send-email-sashak@voltaire.com>
	<11645802302048-git-send-email-sashak@voltaire.com>
	<1165347459.25587.78224.camel@hal.voltaire.com>
	<45769A1C.2090406@mellanox.co.il> <45769DCB.7010105@mellanox.co.il>
Message-ID: <1165403496.25587.119503.camel@hal.voltaire.com>

On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
> Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> >   
> >> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> >>   
> >>     
> >>> This updates "file" and "updn" (up/down) routing engines which should
> >>> work properly now with changed LFT setup mechanism.
> >>>
> >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >>>     
> >>>       
> >> Thanks. Applied.
> >>   
> >>     
> > Are these patches inserted into SVN or GIT 
> >   
> Ignore this - just cloned GIT and its there

Were your latest regressions run against svn or git clone ?

-- Hal

> > Eitan
> >   
> >> -- Hal
> >>
> >>
> >> _______________________________________________
> >> openib-general mailing list
> >> openib-general at openib.org
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>   
> >>     
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From erezz at voltaire.com  Wed Dec  6 03:29:27 2006
From: erezz at voltaire.com (Erez Zilber)
Date: Wed, 06 Dec 2006 13:29:27 +0200
Subject: [openib-general] [PATCH] IB/iser: Remove unused "write-only"
	variables
In-Reply-To: <ada64cqf9bk.fsf@cisco.com>
References: <ada64cqf9bk.fsf@cisco.com>
Message-ID: <4576A997.9030602@voltaire.com>

Roland Dreier wrote:
> Remove variables that are set but then never looked at in the iSER
> initiator.  These cleanups came from David Binderman's list of "set
> but never used" warnings from icc.
>
> Signed-off-by: Roland Dreier <rolandd at cisco.com>
> ---
> Erez, does this look OK to merge?
>
>  drivers/infiniband/ulp/iser/iser_initiator.c |    4 ----
>  drivers/infiniband/ulp/iser/iser_memory.c    |    3 +--
>  2 files changed, 1 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 9b3d79c..e73c87b 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -487,10 +487,8 @@ int iser_send_control(struct iscsi_conn
>  	struct iscsi_iser_conn *iser_conn = conn->dd_data;
>  	struct iser_desc *mdesc = mtask->dd_data;
>  	struct iser_dto *send_dto = NULL;
> -	unsigned int itt;
>  	unsigned long data_seg_len;
>  	int err = 0;
> -	unsigned char opcode;
>  	struct iser_regd_buf *regd_buf;
>  	struct iser_device *device;
>  
> @@ -512,8 +510,6 @@ int iser_send_control(struct iscsi_conn
>  
>  	iser_reg_single(device, send_dto->regd[0], DMA_TO_DEVICE);
>  
> -	itt = ntohl(mtask->hdr->itt);
> -	opcode = mtask->hdr->opcode & ISCSI_OPCODE_MASK;
>  	data_seg_len = ntoh24(mtask->hdr->dlength);
>  
>  	if (data_seg_len > 0) {
> diff --git a/drivers/infiniband/ulp/iser/iser_memory.c b/drivers/infiniband/ulp/iser/iser_memory.c
> index 0606744..e5a1091 100644
> --- a/drivers/infiniband/ulp/iser/iser_memory.c
> +++ b/drivers/infiniband/ulp/iser/iser_memory.c
> @@ -234,7 +234,7 @@ static int iser_sg_to_page_vec(struct is
>  {
>  	struct scatterlist *sg = (struct scatterlist *)data->buf;
>  	dma_addr_t first_addr, last_addr, page;
> -	int start_aligned, end_aligned;
> +	int end_aligned;
>  	unsigned int cur_page = 0;
>  	unsigned long total_sz = 0;
>  	int i;
> @@ -248,7 +248,6 @@ static int iser_sg_to_page_vec(struct is
>  		first_addr = sg_dma_address(&sg[i]);
>  		last_addr  = first_addr + sg_dma_len(&sg[i]);
>  
> -		start_aligned = !(first_addr & ~MASK_4K);
>  		end_aligned   = !(last_addr  & ~MASK_4K);
>  
>  		/* continue to collect page fragments till aligned or SG ends */
>   
I'm ok with that. Thanks.

-- 

____________________________________________________________

Erez Zilber | 972-9-971-7689

Software Engineer, Storage Team

Voltaire – _The Grid Backbone_

__

www.voltaire.com <http://www.voltaire.com/>


From ogerlitz at voltaire.com  Wed Dec  6 03:33:07 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 13:33:07 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1165359560.14800.210.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	<1165359560.14800.210.camel@brick.pathscale.com>
Message-ID: <4576AA73.105@voltaire.com>

Ralph Campbell wrote:
> On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote:
>> On 12/5/06, Roland Dreier <rdreier at cisco.com> wrote:
> I am not following what you two are saying.

> The ib_dma_mapping_ops functions as implemented by ib_ipath,
> are redefining dma_addr_t as a kernel virtual address.
> When ib_dma_map_single() is called, this is a NOP.
> When ib_dma_map_sg() is called, the dma_map_sg() replacement needs
> to convert a struct page pointer into a kernel virtual address.
> When CONFIG_HIGHMEM is defined, some pages may not be mapped
> into the kernel virtual address space so the driver needs to
> call kmap().  Since the driver can't use the struct scattergather
> to store the kmap() result, a separate table needs to be used
> so the value can be returned by ib_sg_dma_address().

Indeed.

> Doing kmap_atomic() at the point where the kernel virtual
> address is used is not practical since the driver is not
> mapping dma_addr_t to struct page * although it is
> possible to write it that way.  It would mean that
> ib_map_single() would then be more complex in that a
> kernel virtual address would need to be converted to a
> struct page *.

Basically what Roland suggest is that you need to implement SW IOTLB 
mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the 
actual kmap/unmap calls before/after you must touch the data.

Is this impossible?

Or.


From halr at voltaire.com  Wed Dec  6 03:33:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 06:33:57 -0500
Subject: [openib-general] Forwarding tables
In-Reply-To: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com>
References: <1c16cdf90612060252t38f5ab5cn995c2c5140498005@mail.gmail.com>
Message-ID: <1165404763.25587.120284.camel@hal.voltaire.com>

Hi Chev,

On Wed, 2006-12-06 at 05:52, Chevchenkovic Chevchenkovic wrote:
> Hi,
>    I would like to write my own forwarding table to be used by openSM.
> I hope some expert here would help me out in this.
> 1.  How do I write the new table in a file. What is the format used
> and wht are the commands to be used while loading openS?

Run dump_lfts.sh on a subnet to see the file format used for loading.

> 2; Which part of the code should I modify so as to incorporate this
> changing of linear forwarding tables in the code itself.

None if you just want to load it from a file. See opensm man page. The
options are:

       -R, --routing_engine
              This option chooses routing engine instead of Min Hop  algorithm
              (default). Supported engines: updn, file

       -M, --lid_matrix_file
              This  option specifies the name of the lid matrix dump file from
              where switch lid matrices (min hops tables will be loaded.

Also, see osm/doc/modular-routing.doc in the svn or git repository for
userspace management.

If you do want to write an algorithm, then there is some "intrusive"
work to do. Is file based sufficient for now ? Will you be adding an
additional routing algorithm ? Or do you just want to experiment for now
?

-- Hal

> Help would b very much apreciared.


> Best Wishes,
> -Chev
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 
-------------- next part --------------
Modular Routine Engine

Modular routing engine structure has been added to allow
for ease of "plugging" new routing modules.

Currently, only unicast callbacks are supported. Multicast
can be added later.

One existing routing module is up-down "updn", which may be
activate with '-R updn' option (instead of old '-u').

General usage is:
$ opensm -R 'module-name'

There is also a trivial routing module which is able
to load LFT tables from a dump file.

Main features:

- support for unicast LFTs only; support for multicast can be added later
- this will run after min hop matrix calculation
- this will load switch LFTs according to the path entries introduced in
  the dump file
- no additional checks will be performed (such as "is port connected", etc.)
- in case when fabric LIDs were changed this will try to reconstruct LFTs
  correctly if endport GUIDs are represented in the dump file (in order
  to disable this GUIDs may be removed from the dump file or zeroed)

The dump file format is compatible with output of 'ibroute' util and for
whole fabric may be generated with script like this:

  for sw_lid in `ibswitches | awk '{print $NF}'` ; do
	ibroute $sw_lid
  done > /path/to/dump_file

, or using DR paths:

  for sw_dr in `ibnetdiscover -v \
		| sed -ne '/^DR path .* switch /s/^DR path \[\(.*\)\].*$/\1/p' \
		| sed -e 's/\]\[/,/g' \
		| sort -u` ; do
	ibroute -D ${sw_dr}
  done > /path/to/dump_file

This script is dump_lfts.sh

In order to activate new module use:

  opensm -R file -U /path/to/dump_file

If the dump_file is not found or is in error, the default routing 
algorithm is utilized.

The ability to dump switch lid matrices (aka min hops tables) to file and
later to load these is also supported.

The usage is similar to unicast forwarding tables loading from dump
file (introduced by 'file' routing engine), but new lid matrix file
name should be specified by -M or --lid_matrix_file option. For example:

  opensm -R file -M ./opensm-lid-matrix.dump

The dump file is named 'opensm-lid-matrix.dump' and will be generated in
standard opensm dump directory (/var/log by default) when
OSM_LOG_ROUTING logging flag is set.

When routing engine 'file' is activated, but dump file is not specified
or not cannot be open default lid matrix algorithm will be used.

There is also a switch forwarding tables dumper which generates
a file compatible with dump_lfts.sh output. This file can be used
as input for forwarding tables loading by 'file' routing engine.
Both or one of options -U and -M can be specified together with '-R file'.

NOTE: ibroute has been updated (for switch management ports) to support this.
Also, lmc was added to switch management ports. ibroute needs to be r7855 or
later from the trunk.


From ogerlitz at voltaire.com  Wed Dec  6 03:35:29 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 06 Dec 2006 13:35:29 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <adairgqc9nb.fsf@cisco.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	<adairgqc9nb.fsf@cisco.com>
Message-ID: <4576AB01.9070206@voltaire.com>

Roland Dreier wrote:
>  > A problem  see with the dma_addr_t being a cookie into a table of kv
>  > addresses is that its legal for a consumer to use dma_addr_t with an
>  > **offset** . So she gets addr y from ib_dma_map_xxx and then uses y +
>  > offset in the SGE provided to ibv_post_send/recv or to the fmr map
>  > function.
> 
> Yes, that is a little bit of an issue.  But I think it just means the
> ipath driver needs to keep page tables exactly the way an IOTLB would
> -- ugly but not impossible to handle.

OK

> 
>  > I see now that i have managed to confuse myself b/c as Roland wrote
>  > below and i have agreed we don't actually have the kv addr for and
>  > unmapped page before the ipath driver maps it ie when it attempt to
>  > use the page... It becomes late here... am i inventing a non existant
>  > problem with the offset?
> 
> The dma address doesn't have to be a kvaddr -- it is purely an address
> space defined by the low-level driver.

OK, you are right, the dma_addr_t returned by the ipath ib_map_xxx calls 
  would live in a virtual space defined by the ipath implementation but 
have to be presented in the form of dma_addr_t

Or.


From eitan at mellanox.co.il  Wed Dec  6 03:35:21 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 6 Dec 2006 13:35:21 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
 up/downrouting engines
Message-ID: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>

Run against SVN.
Will move to GIT today (hopefully  - if I am able to git clone without
password ...)

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Wednesday, December 06, 2006 1:12 PM
> To: Eitan Zahavi
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and
> up/downrouting engines
> 
> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
> > Eitan Zahavi wrote:
> > > Hal Rosenstock wrote:
> > >
> > >> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> > >>
> > >>
> > >>> This updates "file" and "updn" (up/down) routing engines which
> > >>> should work properly now with changed LFT setup mechanism.
> > >>>
> > >>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > >>>
> > >>>
> > >> Thanks. Applied.
> > >>
> > >>
> > > Are these patches inserted into SVN or GIT
> > >
> > Ignore this - just cloned GIT and its there
> 
> Were your latest regressions run against svn or git clone ?
> 
> -- Hal
> 
> > > Eitan
> > >
> > >> -- Hal
> > >>
> > >>
> > >> _______________________________________________
> > >> openib-general mailing list
> > >> openib-general at openib.org
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >> To unsubscribe, please visit
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >>
> > >
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> >
> 


From ramachandra.kuchimanchi at qlogic.com  Tue Dec  5 23:06:32 2006
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi)
Date: Wed, 6 Dec 2006 01:06:32 -0600
Subject: [openib-general] ib_send_cm_dreq() and cm_id doubt
In-Reply-To: <4575D061.3010808@ichips.intel.com>
References: <C07C40DB2364324799506DE8FF12F8D817C642@EPEXCH1.qlogic.org>
	<4575D061.3010808@ichips.intel.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D81A10FD@EPEXCH1.qlogic.org>

> > After sending a CM DREQ with ib_send_cm_dreq(), is it OK to destroy
> > the cm_id  without waiting for a DREP ? This is of course assuming
> > that we are not really concerned if the DREQ reached the other end
or not.
> 
> Yes - you can even destroy the cm_id before calling ib_send_cm_dreq(),
which
> will result in sending a DREQ if the cm_id is still connected.
> 
> - Sean

Thanks for the info.

Regards,
Ram


From halr at voltaire.com  Wed Dec  6 03:57:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 06:57:53 -0500
Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query
	with zero LID
In-Reply-To: <457699A2.9070206@mellanox.co.il>
References: <457699A2.9070206@mellanox.co.il>
Message-ID: <1165406233.25587.121329.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-12-06 at 05:21, Eitan Zahavi wrote:
> Hi Hal,
> 
> This is another catch from the nightly simulator based regression.
> Simple: if OpenSM gets a PathRecord that eventually maps into a port 
> with zero LID (either SRC or DST)
> if just asserts (in debug mode) on getting the LFT.
> 
> The following patch catches this error.

Thanks. Applied (only to the management git repository).

A couple of related questions:
1. Is this needed as an OFED 1.1 patch ?
2. Is the same thing needed for SA MultiPathRecord ?

-- Hal


From eitan at mellanox.co.il  Wed Dec  6 04:01:13 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 6 Dec 2006 14:01:13 +0200
Subject: [openib-general] [PATCH] osm: OpenSM exits on PathRecord query
	with zero LID
Message-ID: <6C2C79E72C305246B504CBA17B5500C96DF394@mtlexch01.mtl.com>

Hi Hal,

> 1. Is this needed as an OFED 1.1 patch ?
I would leave the OFED 1.1 for now. A wrong query can still crash the SM
but I have not hear about such so-far.
> 2. Is the same thing needed for SA MultiPathRecord ?
Probably yes.


Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL


> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Wednesday, December 06, 2006 1:58 PM
> To: Eitan Zahavi
> Cc: Sasha Khapyorsky; Yevgeny Kliteynik; OPENIB GENERAL
> Subject: Re: [PATCH] osm: OpenSM exits on PathRecord query with zero
LID
> 
> Hi Eitan,
> 
> On Wed, 2006-12-06 at 05:21, Eitan Zahavi wrote:
> > Hi Hal,
> >
> > This is another catch from the nightly simulator based regression.
> > Simple: if OpenSM gets a PathRecord that eventually maps into a port
> > with zero LID (either SRC or DST) if just asserts (in debug mode) on
> > getting the LFT.
> >
> > The following patch catches this error.
> 
> Thanks. Applied (only to the management git repository).
> 
> A couple of related questions:
> 1. Is this needed as an OFED 1.1 patch ?
> 2. Is the same thing needed for SA MultiPathRecord ?
> 
> -- Hal


From eitan at mellanox.co.il  Wed Dec  6 05:18:52 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 15:18:52 +0200
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or
 switch]
In-Reply-To: <457698BE.10907@mellanox.co.il>
References: <457698BE.10907@mellanox.co.il>
Message-ID: <4576C33C.7050204@mellanox.co.il>

Hi Hal,

Here is the same patch against GIT for your convenience.

Thanks

EZ

The simulated regression caught this:
The osm.mcfdbs have now the format:
Switch 0x0002c90000000006
LID    : Out Port(s)
0xC000 : 0x003  0x004  0x005  0x006
0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 
:0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 
:0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B 
:0xC01C :0xC01D :0xC01E :0xC01F :

Which should probably just be:
Switch 0x0002c90000000006
LID    : Out Port(s)
0xC000 : 0x003  0x004  0x005  0x006

Actually switches that do not have any MCG entry will not be included
in the dump file.

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

--- osm/opensm/osm_mcast_mgr.c    2006-12-06 12:39:13.018015000 +0200
+++ osm/opensm/osm_mcast_mgr.c    2006-12-06 12:06:29.602097000 +0200
@@ -1388,10 +1389,13 @@ mcast_mgr_dump_sw_routes(
   int16_t               mlid_start_ho;
   uint8_t               position = 0;
   int16_t               block_num = 0;
-  boolean_t             print_lid;
+  boolean_t             first_mlid;
+  boolean_t             first_port;
   const osm_node_t*     p_node;
   uint16_t              i, j;
   uint16_t              mask_entry;
+  char                  sw_hdr[256];
+  char                  mlid_hdr[32];
 
   OSM_LOG_ENTER( p_mgr->p_log, mcast_mgr_dump_sw_routes );
  
@@ -1402,9 +1406,10 @@ mcast_mgr_dump_sw_routes(
 
   p_tbl = osm_switch_get_mcast_tbl_ptr( p_sw );
 
-  fprintf( file, "\nSwitch 0x%016" PRIx64 "\n"
+  sprintf( sw_hdr, "\nSwitch 0x%016" PRIx64 "\n"
            "LID    : Out Port(s)\n",
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ) ); 
+           cl_ntoh64( osm_node_get_node_guid( p_node ) ) );
+  first_mlid = TRUE;
   while ( block_num <= p_tbl->max_block_in_use )
   {
     mlid_start_ho = (uint16_t)(block_num * IB_MCAST_BLOCK_SIZE);
@@ -1412,8 +1417,8 @@ mcast_mgr_dump_sw_routes(
     {
       mlid_ho = mlid_start_ho + i;
       position = 0;
-      print_lid = FALSE;
-      fprintf( file, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO );
+      first_port = TRUE;
+      sprintf( mlid_hdr, "0x%04X :", mlid_ho + IB_LID_MCAST_START_HO );
       while ( position <= p_tbl->max_position )
       {
         mask_entry = cl_ntoh16((*p_tbl->p_mask_tbl)[mlid_ho][position]);
@@ -1422,17 +1427,27 @@ mcast_mgr_dump_sw_routes(
           position++;
           continue;
         }
-        print_lid = TRUE;
         for (j = 0 ; j < 16 ; j++)
         {
-          if ( (1 << j) & mask_entry )
-            fprintf( file, " 0x%03X ", j+(position*16) );
+              if ( (1 << j) & mask_entry ) {
+              if (first_mlid)
+              {
+                 fprintf( file,"%s", sw_hdr );
+                 first_mlid = FALSE;
+              }
+              if (first_port)
+              {
+                 fprintf( file,"%s", mlid_hdr );
+                 first_port = FALSE;
+              }
+                  fprintf( file, " 0x%03X ", j+(position*16) );
+              }
         }
         position++;
       }
-      if (print_lid)
+      if (first_port == FALSE)
       {
-        fprintf( file, "\n" );
+         fprintf( file, "\n" );
       }
     }
     block_num++;


From eitan at mellanox.co.il  Wed Dec  6 05:25:08 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 15:25:08 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
 up/downrouting engines
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
Message-ID: <4576C4B4.9080608@mellanox.co.il>

Hi Hal,

I just run one iteration of the simulation regression against the git tree:
The Multicast fails on the change of format of osm.mcfdbs
The Stability flows failed on the change of subnet.lst to osm-subnet.lst ...

doing another loop: run=0 cron=0 hour=14
OsmStress IS1-16.topo ... PASS
LidMgr IS1-16.topo ... PASS
LidMgr IS1-16.topo ... PASS
LidMgr IS1-16.topo ... PASS
LidMgr IS3-128.topo ... PASS
Multicast IS1-16.topo ... FAIL (sleeping 10)
Multicast IS1-16.topo ... FAIL (sleeping 10)
Multicast IS1-16.topo ... FAIL (sleeping 10)
Multicast IS3-128.topo ... FAIL (sleeping 10)
Multicast IS3-loop.topo ... FAIL (sleeping 10)
Stability IS1-16.topo ... FAIL (sleeping 10)
Stability IS1-16.topo ... FAIL (sleeping 10)
Stability IS1-16.topo ... FAIL (sleeping 10)
Stability IS3-128.topo ... FAIL (sleeping 10)
Stability IS3-loop.topo ... FAIL (sleeping 10)
OsmTest IS1-16.topo ... PASS
OsmTest IS1-16.topo ... PASS
OsmTest IS1-16.topo ... PASS
OsmTest IS3-128.topo ... PASS
OsmTest IS3-loop.topo ... PASS
Pkey IS1-16.topo ... PASS
Pkey IS1-16.topo ... PASS
Pkey IS1-16.topo ... PASS
Pkey IS3-128.topo ... PASS
OsmStress IS1-16.topo ... PASS
OsmStress IS1-16.topo ... PASS
OsmStress IS3-128.topo ... PASS


Eitan Zahavi wrote:
> Run against SVN.
> Will move to GIT today (hopefully  - if I am able to git clone without
> password ...)
>
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
>
>
>   
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:halr at voltaire.com]
>> Sent: Wednesday, December 06, 2006 1:12 PM
>> To: Eitan Zahavi
>> Cc: openib-general at openib.org
>> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and
>> up/downrouting engines
>>
>> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
>>     
>>> Eitan Zahavi wrote:
>>>       
>>>> Hal Rosenstock wrote:
>>>>
>>>>         
>>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> This updates "file" and "updn" (up/down) routing engines which
>>>>>> should work properly now with changed LFT setup mechanism.
>>>>>>
>>>>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>>>>>
>>>>>>
>>>>>>             
>>>>> Thanks. Applied.
>>>>>
>>>>>
>>>>>           
>>>> Are these patches inserted into SVN or GIT
>>>>
>>>>         
>>> Ignore this - just cloned GIT and its there
>>>       
>> Were your latest regressions run against svn or git clone ?
>>
>> -- Hal
>>
>>     
>>>> Eitan
>>>>
>>>>         
>>>>> -- Hal
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> openib-general mailing list
>>>>> openib-general at openib.org
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>>> To unsubscribe, please visit
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>> To unsubscribe, please visit
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>>         
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Wed Dec  6 05:26:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 15:26:43 +0200
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or
	switch]
In-Reply-To: <4576C33C.7050204@mellanox.co.il>
References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il>
Message-ID: <20061206132643.GR26787@mellanox.co.il>

> 
> Actually switches that do not have any MCG entry will not be included
> in the dump file.
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> --- osm/opensm/osm_mcast_mgr.c    2006-12-06 12:39:13.018015000 +0200
> +++ osm/opensm/osm_mcast_mgr.c    2006-12-06 12:06:29.602097000 +0200

All, to make integrating patches easier,
please try to actually use git diff to generate patches,
and put patches in following format:

Subject: [PATCH anytext] short log

From: <> <-------- optional author line if not same as person posting
Short explanation for commit log.

Signed-off-by: <>

---

arbirary long explanation

patch


-- 
MST


From mst at mellanox.co.il  Wed Dec  6 05:33:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 6 Dec 2006 15:33:42 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
	up/downrouting engines
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
Message-ID: <20061206133342.GS26787@mellanox.co.il>

> Run against SVN.
> Will move to GIT today (hopefully  - if I am able to git clone without
> password ...)

Note that you do *not* need ssh accound just to clone a git tree.
That's why we are running git-daemon on staging.

-- 
MST


From halr at voltaire.com  Wed Dec  6 05:40:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 08:40:44 -0500
Subject: [openib-general] [PATCH] opensm: switch lookups consolidation
 with osm_get_switch_by_guid()
In-Reply-To: <20061126233200.GA25110@sashak.voltaire.com>
References: <20061126233200.GA25110@sashak.voltaire.com>
Message-ID: <1165412405.25587.125366.camel@hal.voltaire.com>

On Sun, 2006-11-26 at 18:32, Sasha Khapyorsky wrote:
> For switch object lookups, instead of repetead in many places code
> fragments like:
> 
>   p_sw_guid_tbl = &p_mgr->p_subn->sw_guid_tbl;
> 
>   p_sw = (osm_switch_t*)cl_qmap_get( p_sw_guid_tbl, node_guid );
>   if (p_sw == (osm_switch_t*)cl_qmap_end( p_sw_guid_tbl ) ) { ... }
> 
> use already existing "centralized" osm_get_switch_by_guid() function.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Dec  6 05:59:46 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 08:59:46 -0500
Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_inform.c: Removed
 unneeded memory clearing in osm_infr_construct
Message-ID: <1165413570.25587.126121.camel@hal.voltaire.com>

OpenSM/osm_inform.c: Removed unneeded memory clearing in
osm_infr_construct

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c
index 92647ef..8d1a13a 100644
--- a/osm/opensm/osm_inform.c
+++ b/osm/opensm/osm_inform.c
@@ -70,7 +70,7 @@ void
 osm_infr_construct(
   IN osm_infr_t* const p_infr )
 {
-  memset( p_infr, 0, sizeof(osm_infr_t) );
+
 }
 
 /**********************************************************************


From halr at voltaire.com  Wed Dec  6 06:07:53 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 09:07:53 -0500
Subject: [openib-general] [PATCHv2][MINOR] OpenSM/osm_inform.c: In
 osm_infr_new, remove unneeded call to osm_infr_construct
Message-ID: <1165414065.25587.126426.camel@hal.voltaire.com>

OpenSM/osm_inform.c: In osm_infr_new, remove unneeded call to
osm_infr_construct

This is safer in the long term than the previous patch to remove the
memset from osm_infr_construct.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c
index 92647ef..cd40e5d 100644
--- a/osm/opensm/osm_inform.c
+++ b/osm/opensm/osm_inform.c
@@ -110,7 +110,6 @@ osm_infr_new(
   p_infr = (osm_infr_t*)malloc( sizeof(osm_infr_t) );
   if( p_infr )
   {
-    osm_infr_construct( p_infr );
     osm_infr_init( p_infr, p_infr_rec );
   }
 

From halr at voltaire.com  Wed Dec  6 07:08:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 10:08:19 -0500
Subject: [openib-general] [PATCH][MINOR} OpenSM/osm_sa_mcmember_record.c:
 Move some osm_log messages outside of holding lock
Message-ID: <1165417683.25587.128924.camel@hal.voltaire.com>

OpenSM/osm_sa_mcmember_record.c: Move some osm_log messages outside of
holding lock

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 4b06bab..31d1fb5 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1447,12 +1447,6 @@ __osm_mcmr_rcv_leave_mgrp(
         port_join_state & ~(p_recvd_mcmember_rec->scope_state & 0x0F);
       if (new_join_state)
       {
-        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
-                 "__osm_mcmr_rcv_leave_mgrp: "
-                 "After update JoinState != 0. Updating from 0x%X to 0x%X\n",
-                 port_join_state,
-                 new_join_state
-                 );
         /* Just update the result JoinState */
         p_mcm_port->scope_state =
           new_join_state | (p_mcm_port->scope_state & 0xf0);
@@ -1460,6 +1454,13 @@ __osm_mcmr_rcv_leave_mgrp(
         mcmember_rec.scope_state = p_mcm_port->scope_state;
 
         CL_PLOCK_RELEASE( p_rcv->p_lock );
+
+        osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+                 "__osm_mcmr_rcv_leave_mgrp: "
+                 "After update JoinState != 0. Updating from 0x%X to 0x%X\n",
+                 port_join_state,
+                 new_join_state
+                 );
       }
       else
       {
@@ -1649,6 +1650,8 @@ __osm_mcmr_rcv_join_mgrp(
     }
     else
     {
+      CL_PLOCK_RELEASE( p_rcv->p_lock );
+
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_mcmr_rcv_join_mgrp: ERR 1B11: "
                "method = %s, "
@@ -1665,7 +1668,6 @@ __osm_mcmr_rcv_join_mgrp(
                cl_ntoh64( p_recvd_mcmember_rec->mgid.unicast.interface_id ),
                cl_ntoh64( portguid ) );
 
-      CL_PLOCK_RELEASE( p_rcv->p_lock );
       sa_status = IB_SA_MAD_STATUS_INSUF_COMPS;
       osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
       goto Exit;
@@ -1713,6 +1715,11 @@ __osm_mcmr_rcv_join_mgrp(
  
   if (!valid)
   {
+    /* since we might have created the new group we need to cleanup */
+    __cleanup_mgrp(p_rcv, mlid);
+
+    CL_PLOCK_RELEASE( p_rcv->p_lock );
+
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
              "__osm_mcmr_rcv_join_mgrp: ERR 1B12: "
              "__validate_more_comp_fields, __validate_port_caps, "
@@ -1720,11 +1727,6 @@ __osm_mcmr_rcv_join_mgrp(
              "sending IB_SA_MAD_STATUS_REQ_INVALID\n",
              cl_ntoh64( portguid ) );
 
-    /* since we might have created the new group we need to cleanup */
-    __cleanup_mgrp(p_rcv, mlid);
-
-    CL_PLOCK_RELEASE( p_rcv->p_lock );
-
     sa_status = IB_SA_MAD_STATUS_REQ_INVALID;
     osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
     goto Exit;
@@ -1746,13 +1748,13 @@ __osm_mcmr_rcv_join_mgrp(
                               &p_mcmr_port);
     if (!valid)
     {
+      CL_PLOCK_RELEASE( p_rcv->p_lock );
+
       osm_log( p_rcv->p_log, OSM_LOG_ERROR,
                "__osm_mcmr_rcv_join_mgrp: ERR 1B13: "
                "__validate_modify failed, "
                "sending IB_SA_MAD_STATUS_REQ_INVALID\n" );
 
-      CL_PLOCK_RELEASE( p_rcv->p_lock );
-
       sa_status = IB_SA_MAD_STATUS_REQ_INVALID;
       osm_sa_send_error( p_rcv->p_resp, p_madw, sa_status );
       goto Exit;


From halr at voltaire.com  Wed Dec  6 07:46:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 10:46:40 -0500
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
 up/downrouting engines
In-Reply-To: <4576C4B4.9080608@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
	<4576C4B4.9080608@mellanox.co.il>
Message-ID: <1165419916.25587.130371.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote:
> Hi Hal,
> 
> I just run one iteration of the simulation regression against the git tree:
> The Multicast fails on the change of format of osm.mcfdbs

Is this change in OFED 1.1 too ? If so, can the validation be enhanced
to handle the empty MLID case ? 

> The Stability flows failed on the change of subnet.lst to osm-subnet.lst ...

Yes, this patch went out on the list on 11/29 and committed on 11/30.
We had agreed this would be done after SC. Can the verification be
changed to look for this file so this doesn't fail ?

It also indicated that a similar change is needed to ibutils
Has that been done ?

-- Hal

> doing another loop: run=0 cron=0 hour=14
> OsmStress IS1-16.topo ... PASS
> LidMgr IS1-16.topo ... PASS
> LidMgr IS1-16.topo ... PASS
> LidMgr IS1-16.topo ... PASS
> LidMgr IS3-128.topo ... PASS
> Multicast IS1-16.topo ... FAIL (sleeping 10)
> Multicast IS1-16.topo ... FAIL (sleeping 10)
> Multicast IS1-16.topo ... FAIL (sleeping 10)
> Multicast IS3-128.topo ... FAIL (sleeping 10)
> Multicast IS3-loop.topo ... FAIL (sleeping 10)
> Stability IS1-16.topo ... FAIL (sleeping 10)
> Stability IS1-16.topo ... FAIL (sleeping 10)
> Stability IS1-16.topo ... FAIL (sleeping 10)
> Stability IS3-128.topo ... FAIL (sleeping 10)
> Stability IS3-loop.topo ... FAIL (sleeping 10)
> OsmTest IS1-16.topo ... PASS
> OsmTest IS1-16.topo ... PASS
> OsmTest IS1-16.topo ... PASS
> OsmTest IS3-128.topo ... PASS
> OsmTest IS3-loop.topo ... PASS
> Pkey IS1-16.topo ... PASS
> Pkey IS1-16.topo ... PASS
> Pkey IS1-16.topo ... PASS
> Pkey IS3-128.topo ... PASS
> OsmStress IS1-16.topo ... PASS
> OsmStress IS1-16.topo ... PASS
> OsmStress IS3-128.topo ... PASS
> 
> 
> Eitan Zahavi wrote:
> > Run against SVN.
> > Will move to GIT today (hopefully  - if I am able to git clone without
> > password ...)
> >
> > Eitan Zahavi
> > Senior Engineering Director, Software Architect
> > Mellanox Technologies LTD
> > Tel:+972-4-9097208
> > Fax:+972-4-9593245
> > P.O. Box 586 Yokneam 20692 ISRAEL
> >
> >
> >   
> >> -----Original Message-----
> >> From: Hal Rosenstock [mailto:halr at voltaire.com]
> >> Sent: Wednesday, December 06, 2006 1:12 PM
> >> To: Eitan Zahavi
> >> Cc: openib-general at openib.org
> >> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and
> >> up/downrouting engines
> >>
> >> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
> >>     
> >>> Eitan Zahavi wrote:
> >>>       
> >>>> Hal Rosenstock wrote:
> >>>>
> >>>>         
> >>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> >>>>>
> >>>>>
> >>>>>           
> >>>>>> This updates "file" and "updn" (up/down) routing engines which
> >>>>>> should work properly now with changed LFT setup mechanism.
> >>>>>>
> >>>>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >>>>>>
> >>>>>>
> >>>>>>             
> >>>>> Thanks. Applied.
> >>>>>
> >>>>>
> >>>>>           
> >>>> Are these patches inserted into SVN or GIT
> >>>>
> >>>>         
> >>> Ignore this - just cloned GIT and its there
> >>>       
> >> Were your latest regressions run against svn or git clone ?
> >>
> >> -- Hal
> >>
> >>     
> >>>> Eitan
> >>>>
> >>>>         
> >>>>> -- Hal
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> openib-general mailing list
> >>>>> openib-general at openib.org
> >>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>
> >>>>> To unsubscribe, please visit
> >>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>
> >>>>>
> >>>>>           
> >>>> _______________________________________________
> >>>> openib-general mailing list
> >>>> openib-general at openib.org
> >>>> http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>> To unsubscribe, please visit
> >>>> http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>>         
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From sweitzen at cisco.com  Wed Dec  6 08:43:47 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 6 Dec 2006 08:43:47 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9486@xmb-sjc-216.amer.cisco.com>

> d. Limitations
> UDP multicast and UDP connections to IPoIB UD mode
> currently don't work since we get packets that are too large to
> send over a UD QP.
> As a work around, one can now create separate interfaces
> for use with CM and UD mode.

You can't send UDP/multicast traffic at all between IPoIB CM and IPoIB
UD?  What about UDP/multicast between IPoIB CM hosts?

Scott


From eitan at mellanox.co.il  Wed Dec  6 09:42:47 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 19:42:47 +0200
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
 up/downrouting engines
In-Reply-To: <1165419916.25587.130371.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
	<4576C4B4.9080608@mellanox.co.il>
	<1165419916.25587.130371.camel@hal.voltaire.com>
Message-ID: <45770117.2060306@mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote:
>   
>> Hi Hal,
>>
>> I just run one iteration of the simulation regression against the git tree:
>> The Multicast fails on the change of format of osm.mcfdbs
>>     
>
> Is this change in OFED 1.1 too ? If so, can the validation be enhanced
> to handle the empty MLID case ? 
>   
The current format (broken) where multiple MLIDs apear on one line is 
harder to manage.
I will also need to change ibutils to generate the new format.Whenever 
such a format change
I have to chase it through whatever utility is out there that breaks.
I do not see any reason why it had to change. I understand it was broken 
by the fix that eliminate the need for
opening the file and appending to it.
Instead of modifying ibutils and the simulator tests I propose to fix it 
back to what it was
using the patch I provided.

>   
>> The Stability flows failed on the change of subnet.lst to osm-subnet.lst ...
>>     
>
> Yes, this patch went out on the list on 11/29 and committed on 11/30.
> We had agreed this would be done after SC. Can the verification be
> changed to look for this file so this doesn't fail ?
>   
Yes this is a simple fix and it was already pushed into ibutils. I 
missed the simulator tests and pushed the change today.
> It also indicated that a similar change is needed to ibutils
> Has that been done ?
>   
Yes ibutils modified to accommodate for this change.
> -- Hal
>
>   
>> doing another loop: run=0 cron=0 hour=14
>> OsmStress IS1-16.topo ... PASS
>> LidMgr IS1-16.topo ... PASS
>> LidMgr IS1-16.topo ... PASS
>> LidMgr IS1-16.topo ... PASS
>> LidMgr IS3-128.topo ... PASS
>> Multicast IS1-16.topo ... FAIL (sleeping 10)
>> Multicast IS1-16.topo ... FAIL (sleeping 10)
>> Multicast IS1-16.topo ... FAIL (sleeping 10)
>> Multicast IS3-128.topo ... FAIL (sleeping 10)
>> Multicast IS3-loop.topo ... FAIL (sleeping 10)
>> Stability IS1-16.topo ... FAIL (sleeping 10)
>> Stability IS1-16.topo ... FAIL (sleeping 10)
>> Stability IS1-16.topo ... FAIL (sleeping 10)
>> Stability IS3-128.topo ... FAIL (sleeping 10)
>> Stability IS3-loop.topo ... FAIL (sleeping 10)
>> OsmTest IS1-16.topo ... PASS
>> OsmTest IS1-16.topo ... PASS
>> OsmTest IS1-16.topo ... PASS
>> OsmTest IS3-128.topo ... PASS
>> OsmTest IS3-loop.topo ... PASS
>> Pkey IS1-16.topo ... PASS
>> Pkey IS1-16.topo ... PASS
>> Pkey IS1-16.topo ... PASS
>> Pkey IS3-128.topo ... PASS
>> OsmStress IS1-16.topo ... PASS
>> OsmStress IS1-16.topo ... PASS
>> OsmStress IS3-128.topo ... PASS
>>
>>
>> Eitan Zahavi wrote:
>>     
>>> Run against SVN.
>>> Will move to GIT today (hopefully  - if I am able to git clone without
>>> password ...)
>>>
>>> Eitan Zahavi
>>> Senior Engineering Director, Software Architect
>>> Mellanox Technologies LTD
>>> Tel:+972-4-9097208
>>> Fax:+972-4-9593245
>>> P.O. Box 586 Yokneam 20692 ISRAEL
>>>
>>>
>>>   
>>>       
>>>> -----Original Message-----
>>>> From: Hal Rosenstock [mailto:halr at voltaire.com]
>>>> Sent: Wednesday, December 06, 2006 1:12 PM
>>>> To: Eitan Zahavi
>>>> Cc: openib-general at openib.org
>>>> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and
>>>> up/downrouting engines
>>>>
>>>> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
>>>>     
>>>>         
>>>>> Eitan Zahavi wrote:
>>>>>       
>>>>>           
>>>>>> Hal Rosenstock wrote:
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>>>> This updates "file" and "updn" (up/down) routing engines which
>>>>>>>> should work properly now with changed LFT setup mechanism.
>>>>>>>>
>>>>>>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>                 
>>>>>>> Thanks. Applied.
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>> Are these patches inserted into SVN or GIT
>>>>>>
>>>>>>         
>>>>>>             
>>>>> Ignore this - just cloned GIT and its there
>>>>>       
>>>>>           
>>>> Were your latest regressions run against svn or git clone ?
>>>>
>>>> -- Hal
>>>>
>>>>     
>>>>         
>>>>>> Eitan
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> -- Hal
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> openib-general mailing list
>>>>>>> openib-general at openib.org
>>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>>
>>>>>>> To unsubscribe, please visit
>>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>               
>>>>>> _______________________________________________
>>>>>> openib-general mailing list
>>>>>> openib-general at openib.org
>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>
>>>>>> To unsubscribe, please visit
>>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>>
>>>>>>         
>>>>>>             
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mshefty at ichips.intel.com  Wed Dec  6 09:45:00 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 06 Dec 2006 09:45:00 -0800
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <20061206080317.GG26787@mellanox.co.il>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<000601c714e0$d955fef0$92cc180a@amr.corp.intel.com>
	<20061206080317.GG26787@mellanox.co.il>
Message-ID: <4577019C.7050900@ichips.intel.com>

> Just to clarify this point - what connecton messages can be lost?
> E.g. if the passive side does not get an RTU for a while, it will
> retry the REP, won't it?  Diagram 12.9.6 seems to indicate so:
> from REP Sent we should go to RTU timeout, Send REP and back to REP Sent.
> Is this implemented?

REP retries are already implemented in the ib_cm.  This handles the case where 
the RTU is repeatedly lost, but data is still received on the connection.

- Sean


From eitan at mellanox.co.il  Wed Dec  6 10:02:01 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 20:02:01 +0200
Subject: [openib-general] osm: More simulation faiures on trunk
Message-ID: <45770599.7080005@mellanox.co.il>

Hi Hal,

Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when 
opensm
is invoked with updn routing engine. I will be working on finding what 
changed between OFED 1.1 and the trunk.
This is another cause for the failure of all osmMulticastRoutingTest and 
osmStability tests runs.

Another one would be the change of the osm.mcfdbs which is parsed by 
IBDM too.

Eitan


From elsen_david at yahoo.com  Wed Dec  6 10:03:49 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 10:03:49 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <4570CAA8.5080806@cse.ohio-state.edu>
Message-ID: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com>

Shaun / Steve,

To pass the "librdmacm.so: cannot open shared object file: No such file or
>> directory" error message, LD_RUN_PATH also need to be set. 

Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying to figure out how to run the various nenchmark tests using this MPI tool.

Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want to run the OSC benchmark tests. Any guideline availabvle for these please?
Thanks,
David


Shaun Rowland <rowland at cse.ohio-state.edu> wrote: Steve Wise wrote:
> I haven't tested mvapich2 with ammasso.  But OSU has. I'm CCing their
> dev team so maybe they can help.
> 
> Steve.
> 
> 
> 
> On Fri, 2006-12-01 at 14:58 -0800, david elsen wrote:
>> Steve,
>>
>> I can run rping, rdma_lat etc on the Ammasso card but when I try to
>> run the mvapich2 (0.9.8-Release), I get librdmacm.so missing error. 
>>
>> ./mpdboot -n 1
>> debug: starting
>> /root/0.9.8-RELEASE/bin/mpdroot: error while loading shared libraries:
>> librdmacm.so: cannot open shared object file: No such file or
>> directory
>> running mpdallexit on ammasso1
>> LAUNCHED mpd on ammasso1 via  
>> debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
>> debug: mpd on ammasso1 on port 35352
>> RUNNING: mpd on ammasso1
>> debug: info for running mpd: {'ncpus': 1, 'list_port': 35352,
>> 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}

Hello David and Steve. We discussed this problem in detail on the
mvapich-discuss list recently. David, you indicated the following in
your last email about this to mvapich-discuss on 11/26/2006:

"For some reason, it is working in SuSE, and not working in Fedora."

Is this still the case? Were the libraries built specifically on the
Fedora Core 6 system, or are you using libraries that were built on
SuSE? I assume they were built on Fedora Core 6. Were you trying to run
this as root or as a regular user? I am not sure exactly how this might
affect shared library loading, but it is possible there is a difference.

In our previous discussion, your library path did indeed have a
librdmacm.so file, though it could not be loaded for an unknown reason.
It is unclear to me if this email thread indicates that you have tried
to rebuild that and are experiencing the same issue. Where you able to
try running that test shared library example I gave and did it work? Did
it work as the same user you are trying to run MVAPICH as?

It seems clear this is a runtime loader problem on Fedora Core 6, or on
your particular configuration there. That is what cannot find the
library. It is similar to the libtest code I provided as an example:

[rowland at e14-oib libtest]$ ls
Makefile  test.c  test.h  test-program.c

[rowland at e14-oib libtest]$ make normal
gcc -c -fPIC test.c
gcc -shared -Wl,-soname,libtest.so.1 -o libtest.so.1.0 test.o
ln -s libtest.so.1.0 libtest.so.1
ln -s libtest.so.1 libtest.so
gcc    -c -o test-program.o test-program.c
gcc -o test-program test-program.o -L/home/7/rowland/libtest -ltest

[rowland at e14-oib libtest]$ ldd test-program
         libtest.so.1 => not found
         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
./test-program: error while loading shared libraries: libtest.so.1: 
cannot open shared object file: No such file or directory

[rowland at e14-oib libtest]$ export LD_LIBRARY_PATH=$PWD

[rowland at e14-oib libtest]$ ldd test-program
         libtest.so.1 => /home/7/rowland/libtest/libtest.so.1 
(0x00002abbf9aee000)
         libc.so.6 => /lib64/tls/libc.so.6 (0x0000003bf1900000)
         /lib64/ld-linux-x86-64.so.2 (0x0000003bf1700000)

[rowland at e14-oib libtest]$ ./test-program
In shared library function...

In previous email your ldd output showed the library was being found:

Please see the output of ldd /usr/local/mvapich2/bin/mpdroot :
[root at ammasso1 ~]# ldd /usr/local/mvapich2/bin/mpdroot
         linux-gate.so.1 =>  (0xffffe000)
         librdmacm.so => /usr/local/lib/librdmacm.so (0xb7fec000)
         libibverbs.so.2 => /usr/local/lib/libibverbs.so.2 (0xb7fe5000)
         libibumad.so.1 => /usr/local/lib/libibumad.so.1 (0xb7fdc000)
         libpthread.so.0 => /lib/libpthread.so.0 (0x0012a000)
         libc.so.6 => /lib/libc.so.6 (0x00ca7000)
         libsysfs.so.2 => /usr/lib/libsysfs.so.2 (0x00369000)
         libdl.so.2 => /lib/libdl.so.2 (0x00de6000)
         libibcommon.so.1 => /usr/local/lib/libibcommon.so.1 (0xb7fcb000)
         /lib/ld-linux.so.2 (0x002d8000)

But that path is different than the one you are quoting above. Does an
ldd on /root/0.9.8-RELEASE/bin/mpdroot find librdmacm.so too, as the
same user you are trying to run it as?

I have one more idea for you to try here. You can do the following:

export LD_DEBUG=all
/root/0.9.8-RELEASE/bin/mpdroot >&output
unset LD_DEBUG

Then take a look at the output file to see if there are any relevant
error messages. Don't forget to unset LD_DEBUG before doing anything else.

Also, just to be sure, if you run "file 
" what
does it say? It should indicate that it is a shared library as similarly to:

[rowland at e14-oib libtest]$ file /usr/local/ofed/lib64/librdmacm.so*
/usr/local/ofed/lib64/librdmacm.so:       symbolic link to 
`librdmacm.so.0.9.0'
/usr/local/ofed/lib64/librdmacm.so.0.9.0: ELF 64-bit LSB shared object, 
AMD x86-64, version 1 (SYSV), not stripped

Unfortunately, we do not have any Fedora Core 6 systems to investigate
this problem on at this time, and I don't know anything about what might
be there that would cause a problem. As far as I know, there shouldn't
be. However, it seems there is some runtime issue on your Fedora Core 6
machine or with how this is being run there. If it is in fact working on
another distribution as you indicated in your previous response to us,
then that also strongly points in this direction.
-- 
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/45197829/attachment.html>

From mshefty at ichips.intel.com  Wed Dec  6 09:51:03 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 06 Dec 2006 09:51:03 -0800
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <457692E5.2050800@voltaire.com>
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>
	<adaejrec9lw.fsf@cisco.com> <457692E5.2050800@voltaire.com>
Message-ID: <45770307.5060101@ichips.intel.com>

> Sean - as of the stability issues reported by Mellanox I understand you 
> have decided not to push the multicast code for 2.6.20 and I see that 
> the focus now is on finding the bug. Once this is solved I would like to 
> provide more feedback before you publish v3 - does it makes sense?

The multicast code that I had has been added as a branch to my rdma-dev git tree 
that's available from the openfabrics server.  A corresponding branch is in the 
librdmacm tree.  I have not had time yet to update the multicast code based on 
the latest feedback.

- Sean


From shubbell at dbresearch.net  Wed Dec  6 09:52:50 2006
From: shubbell at dbresearch.net (Sean Hubbell)
Date: Wed, 06 Dec 2006 11:52:50 -0600
Subject: [openib-general] Multicast Group Routing Question
Message-ID: <45770372.8010700@dbresearch.net>

Hello,

  I was testing our code and noticed that when I send data using 
multicast over our ib0 interface, all of the infiniband switches route 
the data to each switch and each node instead of a node that has an 
application listening to the interface like Ethernet. Is this by design?

Thanks in advance,

Sean


From rdreier at cisco.com  Wed Dec  6 10:11:31 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 06 Dec 2006 10:11:31 -0800
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <457692E5.2050800@voltaire.com> (Or Gerlitz's message of
	"Wed, 06 Dec 2006 11:52:37 +0200")
References: <000301c714df$8ce57920$92cc180a@amr.corp.intel.com>
	<15ddcffd0612051324l58969f4wb9dee25256f14f8f@mail.gmail.com>
	<adaejrec9lw.fsf@cisco.com> <457692E5.2050800@voltaire.com>
Message-ID: <ada1wncc2ss.fsf@cisco.com>

 > + 5/5 is the CMA user space support. I only did a light review of it
 > but my understanding is that Sean used the in kernel ib_ucm
 > design/code as the base line for this driver so there should be no
 > special issues here.

OK, I'll have to take a close look at this.  ucm has known-broken
object lifetime handling (probably oopsable from userspace)


From halr at voltaire.com  Wed Dec  6 10:14:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 13:14:10 -0500
Subject: [openib-general] [PATCH 5/5] opensm: updates file and
 up/downrouting engines
In-Reply-To: <45770117.2060306@mellanox.co.il>
References: <6C2C79E72C305246B504CBA17B5500C96DF377@mtlexch01.mtl.com>
	<4576C4B4.9080608@mellanox.co.il>
	<1165419916.25587.130371.camel@hal.voltaire.com>
	<45770117.2060306@mellanox.co.il>
Message-ID: <1165428839.25587.136467.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-12-06 at 12:42, Eitan Zahavi wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Wed, 2006-12-06 at 08:25, Eitan Zahavi wrote:
> >   
> >> Hi Hal,
> >>
> >> I just run one iteration of the simulation regression against the git tree:
> >> The Multicast fails on the change of format of osm.mcfdbs
> >>     
> >
> > Is this change in OFED 1.1 too ? If so, can the validation be enhanced
> > to handle the empty MLID case ? 
> >   
> The current format (broken) where multiple MLIDs apear on one line is 
> harder to manage.

Is it true for OFED 1.1 as well ?

Also, I'm in the process of incorporating your change.

> I will also need to change ibutils to generate the new format.Whenever 
> such a format change
> I have to chase it through whatever utility is out there that breaks.
> I do not see any reason why it had to change. I understand it was broken 
> by the fix that eliminate the need for
> opening the file and appending to it.

I'm not sure it was done by design or accident. Anyhow, the patches were
out on the list for quite some time without comment.

> Instead of modifying ibutils and the simulator tests I propose to fix it 
> back to what it was
> using the patch I provided.

I'm in the process of incorporating your patch.

-- Hal

> >> The Stability flows failed on the change of subnet.lst to osm-subnet.lst ...
> >>     
> >
> > Yes, this patch went out on the list on 11/29 and committed on 11/30.
> > We had agreed this would be done after SC. Can the verification be
> > changed to look for this file so this doesn't fail ?
> >   
> Yes this is a simple fix and it was already pushed into ibutils. I 
> missed the simulator tests and pushed the change today.
> > It also indicated that a similar change is needed to ibutils
> > Has that been done ?
> >   
> Yes ibutils modified to accommodate for this change.
> > -- Hal
> >
> >   
> >> doing another loop: run=0 cron=0 hour=14
> >> OsmStress IS1-16.topo ... PASS
> >> LidMgr IS1-16.topo ... PASS
> >> LidMgr IS1-16.topo ... PASS
> >> LidMgr IS1-16.topo ... PASS
> >> LidMgr IS3-128.topo ... PASS
> >> Multicast IS1-16.topo ... FAIL (sleeping 10)
> >> Multicast IS1-16.topo ... FAIL (sleeping 10)
> >> Multicast IS1-16.topo ... FAIL (sleeping 10)
> >> Multicast IS3-128.topo ... FAIL (sleeping 10)
> >> Multicast IS3-loop.topo ... FAIL (sleeping 10)
> >> Stability IS1-16.topo ... FAIL (sleeping 10)
> >> Stability IS1-16.topo ... FAIL (sleeping 10)
> >> Stability IS1-16.topo ... FAIL (sleeping 10)
> >> Stability IS3-128.topo ... FAIL (sleeping 10)
> >> Stability IS3-loop.topo ... FAIL (sleeping 10)
> >> OsmTest IS1-16.topo ... PASS
> >> OsmTest IS1-16.topo ... PASS
> >> OsmTest IS1-16.topo ... PASS
> >> OsmTest IS3-128.topo ... PASS
> >> OsmTest IS3-loop.topo ... PASS
> >> Pkey IS1-16.topo ... PASS
> >> Pkey IS1-16.topo ... PASS
> >> Pkey IS1-16.topo ... PASS
> >> Pkey IS3-128.topo ... PASS
> >> OsmStress IS1-16.topo ... PASS
> >> OsmStress IS1-16.topo ... PASS
> >> OsmStress IS3-128.topo ... PASS
> >>
> >>
> >> Eitan Zahavi wrote:
> >>     
> >>> Run against SVN.
> >>> Will move to GIT today (hopefully  - if I am able to git clone without
> >>> password ...)
> >>>
> >>> Eitan Zahavi
> >>> Senior Engineering Director, Software Architect
> >>> Mellanox Technologies LTD
> >>> Tel:+972-4-9097208
> >>> Fax:+972-4-9593245
> >>> P.O. Box 586 Yokneam 20692 ISRAEL
> >>>
> >>>
> >>>   
> >>>       
> >>>> -----Original Message-----
> >>>> From: Hal Rosenstock [mailto:halr at voltaire.com]
> >>>> Sent: Wednesday, December 06, 2006 1:12 PM
> >>>> To: Eitan Zahavi
> >>>> Cc: openib-general at openib.org
> >>>> Subject: Re: [openib-general] [PATCH 5/5] opensm: updates file and
> >>>> up/downrouting engines
> >>>>
> >>>> On Wed, 2006-12-06 at 05:39, Eitan Zahavi wrote:
> >>>>     
> >>>>         
> >>>>> Eitan Zahavi wrote:
> >>>>>       
> >>>>>           
> >>>>>> Hal Rosenstock wrote:
> >>>>>>
> >>>>>>         
> >>>>>>             
> >>>>>>> On Sun, 2006-11-26 at 17:30, Sasha Khapyorsky wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>           
> >>>>>>>               
> >>>>>>>> This updates "file" and "updn" (up/down) routing engines which
> >>>>>>>> should work properly now with changed LFT setup mechanism.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>             
> >>>>>>>>                 
> >>>>>>> Thanks. Applied.
> >>>>>>>
> >>>>>>>
> >>>>>>>           
> >>>>>>>               
> >>>>>> Are these patches inserted into SVN or GIT
> >>>>>>
> >>>>>>         
> >>>>>>             
> >>>>> Ignore this - just cloned GIT and its there
> >>>>>       
> >>>>>           
> >>>> Were your latest regressions run against svn or git clone ?
> >>>>
> >>>> -- Hal
> >>>>
> >>>>     
> >>>>         
> >>>>>> Eitan
> >>>>>>
> >>>>>>         
> >>>>>>             
> >>>>>>> -- Hal
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> openib-general mailing list
> >>>>>>> openib-general at openib.org
> >>>>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>>>
> >>>>>>> To unsubscribe, please visit
> >>>>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>>>
> >>>>>>>
> >>>>>>>           
> >>>>>>>               
> >>>>>> _______________________________________________
> >>>>>> openib-general mailing list
> >>>>>> openib-general at openib.org
> >>>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>>
> >>>>>> To unsubscribe, please visit
> >>>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>>
> >>>>>>         
> >>>>>>             
> >>> _______________________________________________
> >>> openib-general mailing list
> >>> openib-general at openib.org
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>   
> >>>       
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From ralph.campbell at qlogic.com  Wed Dec  6 10:16:34 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 06 Dec 2006 10:16:34 -0800
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <4576AA73.105@voltaire.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	<1165359560.14800.210.camel@brick.pathscale.com>
	<4576AA73.105@voltaire.com>
Message-ID: <1165428994.14800.229.camel@brick.pathscale.com>

On Wed, 2006-12-06 at 13:33 +0200, Or Gerlitz wrote:
> Ralph Campbell wrote:
> > On Tue, 2006-12-05 at 23:21 +0200, Or Gerlitz wrote:
> >> On 12/5/06, Roland Dreier <rdreier at cisco.com> wrote:
> > I am not following what you two are saying.
> 
> > The ib_dma_mapping_ops functions as implemented by ib_ipath,
> > are redefining dma_addr_t as a kernel virtual address.
> > When ib_dma_map_single() is called, this is a NOP.
> > When ib_dma_map_sg() is called, the dma_map_sg() replacement needs
> > to convert a struct page pointer into a kernel virtual address.
> > When CONFIG_HIGHMEM is defined, some pages may not be mapped
> > into the kernel virtual address space so the driver needs to
> > call kmap().  Since the driver can't use the struct scattergather
> > to store the kmap() result, a separate table needs to be used
> > so the value can be returned by ib_sg_dma_address().
> 
> Indeed.
> 
> > Doing kmap_atomic() at the point where the kernel virtual
> > address is used is not practical since the driver is not
> > mapping dma_addr_t to struct page * although it is
> > possible to write it that way.  It would mean that
> > ib_map_single() would then be more complex in that a
> > kernel virtual address would need to be converted to a
> > struct page *.
> 
> Basically what Roland suggest is that you need to implement SW IOTLB 
> mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the 
> actual kmap/unmap calls before/after you must touch the data.
> 
> Is this impossible?
> 
> Or.

It is not impossible, just inefficient.  Why add a mapping
table when it isn't needed?  If I needed to implement HIGMEM
support, I would probably make "dma_addr_t" be a physical
memory address, convert to PFN, find the struct page pointer,
and call kmap_atomic() or page_address().  Why go though all
that in the worst case CPU path when doing the conversion
to kernel virtual address outside the critical path is
feasible?


From halr at voltaire.com  Wed Dec  6 10:27:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 13:27:08 -0500
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <45770372.8010700@dbresearch.net>
References: <45770372.8010700@dbresearch.net>
Message-ID: <1165429589.25587.136986.camel@hal.voltaire.com>

Hi Sean,

On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
> Hello,
> 
>   I was testing our code and noticed that when I send data using 
> multicast over our ib0 interface, all of the infiniband switches route 
> the data to each switch and each node instead of a node that has an 
> application listening to the interface like Ethernet. Is this by design?

It depends on what multicast group is being used and which end nodes
have registered for that group as to where the data is routed.

-- Hal

> Thanks in advance,
> 
> Sean
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From swise at opengridcomputing.com  Wed Dec  6 10:27:32 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 06 Dec 2006 12:27:32 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com>
References: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com>
Message-ID: <1165429652.25183.16.camel@stevo-desktop>

On Wed, 2006-12-06 at 10:03 -0800, david elsen wrote:
> Shaun / Steve,
> 
> To pass the "librdmacm.so: cannot open shared object file: No such
> file or
> >> directory" error message, LD_RUN_PATH also need to be set. 
> 
> Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying
> to figure out how to run the various nenchmark tests using this MPI
> tool.
> 
> Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also
> want to run the OSC benchmark tests. Any guideline availabvle for
> these please?
> Thanks,
> David

I've run IMB benchmarks (aka pallas) on mvapich2 0.9.8 over iwarp.   The
mvapich2 user guide explains how to start up mpd daemons and use
mpiexec.  Its fairly straight forward.  You need ssh or rsh access and
you need to setup a few files.  

Then pull down IMB and build it.

To run 2 node IMB-MPI1 tests, you do something like this:

$ mpdboot -n 2
$ mpiexec -n 2 <path-to-imb>/IMB-MPI1 

This will run the entire MPI1 suite.


Steve.


From mshefty at ichips.intel.com  Wed Dec  6 10:23:31 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 06 Dec 2006 10:23:31 -0800
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
 RDMA CM etc
In-Reply-To: <20061206101705.GP26787@mellanox.co.il>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com>
	<20061206101705.GP26787@mellanox.co.il>
Message-ID: <45770AA3.2040505@ichips.intel.com>

>>>I gather the ucma bits are in rdma_ucm?

Yes.

Basically, I reworked changes that were in svn into separate branches based off 
of 2.6.19.

> 1st is probably to fix the mcast bits so that they don't crash the machine.
> OFED will be based on whatever is merged by Linus by that time + any number of patches
> and out of kernel modules.

Even if the kernel multicast support could make it into 2.6.20, I won't have the 
multicast changes to the rdma_cm done by then.

>>3rd have Sean decide how he wants the multicast support to be integrated 
>>into OFED 1.2, my guess would be as a patch set over the 
>>ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide

Does OFED want the multicast support in 1.2?

> Maybe the right thing is to split the multicast stuff in a separate library,
> or have a separate ABI version for multicast, I don't really know.

My anticipation is that the multicast support will bump the ABI, but will allow 
backwards compatibility.  The break from librdmacm ABI 2 to ABI 3 is a result of 
changing the event reporting.

- Sean


From ralphc at pathscale.com  Wed Dec  6 10:31:59 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Wed, 06 Dec 2006 10:31:59 -0800
Subject: [openib-general] [PATCH v3 1/7] IB/core - Add DMA mapping functions
 to allow device drivers to interpose
Message-ID: <1165429919.14800.238.camel@brick.pathscale.com>

This version of the patch adds ib_dma_alloc_coherent()
and ib_dma_free_coherent() to the list of wrapped DMA
functions.  The earlier V2 patches are the same since
this addition doesn't affect them.


The QLogic InfiniPath HCAs use programmed I/O instead of HW DMA.
This patch allows a verbs device driver to interpose on DMA mapping
function calls in order to avoid relying on bus_to_virt() and
phys_to_virt() to undo the mappings created by dma_map_single(),
dma_map_sg(), etc.

From: Ralph Campbell <ralph.campbell at qlogic.com>

diff -r c76ed2f1387b include/rdma/ib_verbs.h
--- a/include/rdma/ib_verbs.h	Wed Nov 29 13:28:14 2006 +0800
+++ b/include/rdma/ib_verbs.h	Tue Dec 05 15:50:07 2006 -0800
@@ -43,6 +43,8 @@
 
 #include <linux/types.h>
 #include <linux/device.h>
+#include <linux/mm.h>
+#include <linux/dma-mapping.h>
 
 #include <asm/atomic.h>
 #include <asm/scatterlist.h>
@@ -846,6 +848,49 @@ struct ib_cache {
 	struct ib_pkey_cache  **pkey_cache;
 	struct ib_gid_cache   **gid_cache;
 	u8                     *lmc_cache;
+};
+
+struct ib_dma_mapping_ops {
+	int		(*mapping_error)(struct ib_device *dev,
+					 u64 dma_addr);
+	u64		(*map_single)(struct ib_device *dev,
+				      void *ptr, size_t size,
+				      enum dma_data_direction direction);
+	void		(*unmap_single)(struct ib_device *dev,
+					u64 addr, size_t size,
+					enum dma_data_direction direction);
+	u64		(*map_page)(struct ib_device *dev,
+				    struct page *page, unsigned long offset,
+				    size_t size,
+				    enum dma_data_direction direction);
+	void		(*unmap_page)(struct ib_device *dev,
+				      u64 addr, size_t size,
+				      enum dma_data_direction direction);
+	int		(*map_sg)(struct ib_device *dev,
+				  struct scatterlist *sg, int nents,
+				  enum dma_data_direction direction);
+	void		(*unmap_sg)(struct ib_device *dev,
+				    struct scatterlist *sg, int nents,
+				    enum dma_data_direction direction);
+	u64		(*dma_address)(struct ib_device *dev,
+				       struct scatterlist *sg);
+	unsigned int	(*dma_len)(struct ib_device *dev,
+				   struct scatterlist *sg);
+	void		(*sync_single_for_cpu)(struct ib_device *dev,
+					       u64 dma_handle,
+					       size_t size,
+				               enum dma_data_direction dir);
+	void		(*sync_single_for_device)(struct ib_device *dev,
+						  u64 dma_handle,
+						  size_t size,
+						  enum dma_data_direction dir);
+	void		*(*alloc_coherent)(struct ib_device *dev,
+					   size_t size,
+					   u64 *dma_handle,
+					   gfp_t flag);
+	void		(*free_coherent)(struct ib_device *dev,
+					 size_t size, void *cpu_addr,
+					 u64 dma_handle);
 };
 
 struct iw_cm_verbs;
@@ -992,6 +1037,8 @@ struct ib_device {
 						  struct ib_mad *in_mad,
 						  struct ib_mad *out_mad);
 
+	struct ib_dma_mapping_ops   *dma_ops;
+
 	struct module               *owner;
 	struct class_device          class_dev;
 	struct kobject               ports_parent;
@@ -1395,8 +1442,214 @@ static inline int ib_req_ncomp_notif(str
  *   usable for DMA.
  * @pd: The protection domain associated with the memory region.
  * @mr_access_flags: Specifies the memory access rights.
+ *
+ * Note that the ib_dma_*() functions defined below must be used 
+ * to create/destroy addresses used with the Lkey or Rkey returned
+ * by ib_get_dma_mr().
  */
 struct ib_mr *ib_get_dma_mr(struct ib_pd *pd, int mr_access_flags);
+
+/**
+ * ib_dma_mapping_error - check a DMA addr for error
+ * @dev: The device for which the dma_addr was created
+ * @dma_addr: The DMA address to check
+ */
+static inline int ib_dma_mapping_error(struct ib_device *dev, u64 dma_addr)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->mapping_error(dev, dma_addr) :
+		dma_mapping_error(dma_addr);
+}
+
+/**
+ * ib_dma_map_single - Map a kernel virtual address to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @cpu_addr: The kernel virtual address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static inline u64 ib_dma_map_single(struct ib_device *dev,
+				    void *cpu_addr, size_t size,
+				    enum dma_data_direction direction)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->map_single(dev, cpu_addr, size, direction) :
+		dma_map_single(dev->dma_device, cpu_addr, size, direction);
+}
+
+/**
+ * ib_dma_unmap_single - Destroy a mapping created by ib_dma_map_single()
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static inline void ib_dma_unmap_single(struct ib_device *dev,
+				       u64 addr, size_t size,
+				       enum dma_data_direction direction)
+{
+	dev->dma_ops ?
+		dev->dma_ops->unmap_single(dev, addr, size, direction) :
+		dma_unmap_single(dev->dma_device, addr, size, direction);
+}
+
+/**
+ * ib_dma_map_page - Map a physical page to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @page: The page to be mapped
+ * @offset: The offset within the page
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static inline u64 ib_dma_map_page(struct ib_device *dev,
+				  struct page *page,
+				  unsigned long offset,
+				  size_t size,
+					 enum dma_data_direction direction)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->map_page(dev, page, offset, size, direction) :
+		dma_map_page(dev->dma_device, page, offset, size, direction);
+}
+
+/**
+ * ib_dma_unmap_page - Destroy a mapping created by ib_dma_map_page()
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static inline void ib_dma_unmap_page(struct ib_device *dev,
+				     u64 addr, size_t size,
+				     enum dma_data_direction direction)
+{
+	dev->dma_ops ?
+		dev->dma_ops->unmap_page(dev, addr, size, direction) :
+		dma_unmap_page(dev->dma_device, addr, size, direction);
+}
+
+/**
+ * ib_dma_map_sg - Map a scatter/gather list to DMA addresses
+ * @dev: The device for which the DMA addresses are to be created
+ * @sg: The array of scatter/gather entries
+ * @nents: The number of scatter/gather entries
+ * @direction: The direction of the DMA
+ */
+static inline int ib_dma_map_sg(struct ib_device *dev,
+				struct scatterlist *sg, int nents,
+				enum dma_data_direction direction)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->map_sg(dev, sg, nents, direction) :
+		dma_map_sg(dev->dma_device, sg, nents, direction);
+}
+
+/**
+ * ib_dma_unmap_sg - Unmap a scatter/gather list of DMA addresses
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The array of scatter/gather entries
+ * @nents: The number of scatter/gather entries
+ * @direction: The direction of the DMA
+ */
+static inline void ib_dma_unmap_sg(struct ib_device *dev,
+				   struct scatterlist *sg, int nents,
+				   enum dma_data_direction direction)
+{
+	dev->dma_ops ?
+		dev->dma_ops->unmap_sg(dev, sg, nents, direction) :
+		dma_unmap_sg(dev->dma_device, sg, nents, direction);
+}
+
+/**
+ * ib_sg_dma_address - Return the DMA address from a scatter/gather entry
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The scatter/gather entry
+ */
+static inline u64 ib_sg_dma_address(struct ib_device *dev,
+				    struct scatterlist *sg)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->dma_address(dev, sg) : sg_dma_address(sg);
+}
+
+/**
+ * ib_sg_dma_len - Return the DMA length from a scatter/gather entry
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The scatter/gather entry
+ */
+static inline unsigned int ib_sg_dma_len(struct ib_device *dev,
+					 struct scatterlist *sg)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->dma_len(dev, sg) : sg_dma_len(sg);
+}
+
+/**
+ * ib_dma_sync_single_for_cpu - Prepare DMA region to be accessed by CPU
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @dir: The direction of the DMA
+ */
+static inline void ib_dma_sync_single_for_cpu(struct ib_device *dev,
+					      u64 addr,
+					      size_t size,
+					      enum dma_data_direction dir)
+{
+	dev->dma_ops ?
+		dev->dma_ops->sync_single_for_cpu(dev, addr, size, dir) :
+		dma_sync_single_for_cpu(dev->dma_device, addr, size, dir);
+}
+
+/**
+ * ib_dma_sync_single_for_device - Prepare DMA region to be accessed by device
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @dir: The direction of the DMA
+ */
+static inline void ib_dma_sync_single_for_device(struct ib_device *dev,
+						 u64 addr,
+						 size_t size,
+						 enum dma_data_direction dir)
+{
+	dev->dma_ops ?
+		dev->dma_ops->sync_single_for_device(dev, addr, size, dir) :
+		dma_sync_single_for_device(dev->dma_device, addr, size, dir);
+}
+
+/**
+ * ib_dma_alloc_coherent - Allocate memory and map it for DMA
+ * @dev: The device for which the DMA address is requested
+ * @size: The size of the region to allocate in bytes
+ * @dma_handle: A pointer for returning the DMA address of the region
+ * @flag: memory allocator flags
+ */
+static inline void *ib_dma_alloc_coherent(struct ib_device *dev,
+					   size_t size,
+					   u64 *dma_handle,
+					   gfp_t flag)
+{
+	return dev->dma_ops ?
+		dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag) :
+		dma_alloc_coherent(dev->dma_device, size, dma_handle, flag);
+}
+
+/**
+ * ib_dma_free_coherent - Free memory allocated by ib_dma_alloc_coherent()
+ * @dev: The device for which the DMA addresses were allocated
+ * @size: The size of the region
+ * @cpu_addr: the address returned by ib_dma_alloc_coherent()
+ * @dma_handle: the DMA address returned by ib_dma_alloc_coherent()
+ */
+static inline void ib_dma_free_coherent(struct ib_device *dev,
+					size_t size, void *cpu_addr,
+					u64 dma_handle)
+{
+	dev->dma_ops ?
+		dev->dma_ops->free_coherent(dev, size, cpu_addr, dma_handle) :
+		dma_free_coherent(dev->dma_device, size, cpu_addr, dma_handle);
+}
 
 /**
  * ib_reg_phys_mr - Prepares a virtually addressed memory region for use
diff -r c76ed2f1387b drivers/infiniband/core/mad.c
--- a/drivers/infiniband/core/mad.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/core/mad.c	Wed Nov 29 13:54:36 2006 -0800
@@ -999,16 +999,16 @@ int ib_send_mad(struct ib_mad_send_wr_pr
 
 	mad_agent = mad_send_wr->send_buf.mad_agent;
 	sge = mad_send_wr->sg_list;
-	sge[0].addr = dma_map_single(mad_agent->device->dma_device,
-				     mad_send_wr->send_buf.mad,
-				     sge[0].length,
-				     DMA_TO_DEVICE);
+	sge[0].addr = ib_dma_map_single(mad_agent->device,
+					mad_send_wr->send_buf.mad,
+					sge[0].length,
+					DMA_TO_DEVICE);
 	pci_unmap_addr_set(mad_send_wr, header_mapping, sge[0].addr);
 
-	sge[1].addr = dma_map_single(mad_agent->device->dma_device,
-				     ib_get_payload(mad_send_wr),
-				     sge[1].length,
-				     DMA_TO_DEVICE);
+	sge[1].addr = ib_dma_map_single(mad_agent->device,
+					ib_get_payload(mad_send_wr),
+					sge[1].length,
+					DMA_TO_DEVICE);
 	pci_unmap_addr_set(mad_send_wr, payload_mapping, sge[1].addr);
 
 	spin_lock_irqsave(&qp_info->send_queue.lock, flags);
@@ -1027,12 +1027,14 @@ int ib_send_mad(struct ib_mad_send_wr_pr
 	}
 	spin_unlock_irqrestore(&qp_info->send_queue.lock, flags);
 	if (ret) {
-		dma_unmap_single(mad_agent->device->dma_device,
-				 pci_unmap_addr(mad_send_wr, header_mapping),
-				 sge[0].length, DMA_TO_DEVICE);
-		dma_unmap_single(mad_agent->device->dma_device,
-				 pci_unmap_addr(mad_send_wr, payload_mapping),
-				 sge[1].length, DMA_TO_DEVICE);
+		ib_dma_unmap_single(mad_agent->device,
+				    pci_unmap_addr(mad_send_wr,
+						   header_mapping),
+				    sge[0].length, DMA_TO_DEVICE);
+		ib_dma_unmap_single(mad_agent->device,
+				    pci_unmap_addr(mad_send_wr,
+						   payload_mapping),
+				    sge[1].length, DMA_TO_DEVICE);
 	}
 	return ret;
 }
@@ -1851,11 +1853,11 @@ static void ib_mad_recv_done_handler(str
 	mad_priv_hdr = container_of(mad_list, struct ib_mad_private_header,
 				    mad_list);
 	recv = container_of(mad_priv_hdr, struct ib_mad_private, header);
-	dma_unmap_single(port_priv->device->dma_device,
-			 pci_unmap_addr(&recv->header, mapping),
-			 sizeof(struct ib_mad_private) -
-			 sizeof(struct ib_mad_private_header),
-			 DMA_FROM_DEVICE);
+	ib_dma_unmap_single(port_priv->device,
+			    pci_unmap_addr(&recv->header, mapping),
+			    sizeof(struct ib_mad_private) -
+			      sizeof(struct ib_mad_private_header),
+			    DMA_FROM_DEVICE);
 
 	/* Setup MAD receive work completion from "normal" work completion */
 	recv->header.wc = *wc;
@@ -2081,12 +2083,12 @@ static void ib_mad_send_done_handler(str
 	qp_info = send_queue->qp_info;
 
 retry:
-	dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device,
-			 pci_unmap_addr(mad_send_wr, header_mapping),
-			 mad_send_wr->sg_list[0].length, DMA_TO_DEVICE);
-	dma_unmap_single(mad_send_wr->send_buf.mad_agent->device->dma_device,
-			 pci_unmap_addr(mad_send_wr, payload_mapping),
-			 mad_send_wr->sg_list[1].length, DMA_TO_DEVICE);
+	ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device,
+			    pci_unmap_addr(mad_send_wr, header_mapping),
+			    mad_send_wr->sg_list[0].length, DMA_TO_DEVICE);
+	ib_dma_unmap_single(mad_send_wr->send_buf.mad_agent->device,
+			    pci_unmap_addr(mad_send_wr, payload_mapping),
+			    mad_send_wr->sg_list[1].length, DMA_TO_DEVICE);
 	queued_send_wr = NULL;
 	spin_lock_irqsave(&send_queue->lock, flags);
 	list_del(&mad_list->list);
@@ -2527,12 +2529,11 @@ static int ib_mad_post_receive_mads(stru
 				break;
 			}
 		}
-		sg_list.addr = dma_map_single(qp_info->port_priv->
-					        device->dma_device,
-					      &mad_priv->grh,
-					      sizeof *mad_priv -
-					        sizeof mad_priv->header,
-					      DMA_FROM_DEVICE);
+		sg_list.addr = ib_dma_map_single(qp_info->port_priv->device,
+						 &mad_priv->grh,
+						 sizeof *mad_priv -
+						   sizeof mad_priv->header,
+						 DMA_FROM_DEVICE);
 		pci_unmap_addr_set(&mad_priv->header, mapping, sg_list.addr);
 		recv_wr.wr_id = (unsigned long)&mad_priv->header.mad_list;
 		mad_priv->header.mad_list.mad_queue = recv_queue;
@@ -2548,12 +2549,12 @@ static int ib_mad_post_receive_mads(stru
 			list_del(&mad_priv->header.mad_list.list);
 			recv_queue->count--;
 			spin_unlock_irqrestore(&recv_queue->lock, flags);
-			dma_unmap_single(qp_info->port_priv->device->dma_device,
-					 pci_unmap_addr(&mad_priv->header,
-							mapping),
-					 sizeof *mad_priv -
-					   sizeof mad_priv->header,
-					 DMA_FROM_DEVICE);
+			ib_dma_unmap_single(qp_info->port_priv->device,
+					    pci_unmap_addr(&mad_priv->header,
+							   mapping),
+					    sizeof *mad_priv -
+					      sizeof mad_priv->header,
+					    DMA_FROM_DEVICE);
 			kmem_cache_free(ib_mad_cache, mad_priv);
 			printk(KERN_ERR PFX "ib_post_recv failed: %d\n", ret);
 			break;
@@ -2585,11 +2586,11 @@ static void cleanup_recv_queue(struct ib
 		/* Remove from posted receive MAD list */
 		list_del(&mad_list->list);
 
-		dma_unmap_single(qp_info->port_priv->device->dma_device,
-				 pci_unmap_addr(&recv->header, mapping),
-				 sizeof(struct ib_mad_private) -
-				 sizeof(struct ib_mad_private_header),
-				 DMA_FROM_DEVICE);
+		ib_dma_unmap_single(qp_info->port_priv->device,
+				    pci_unmap_addr(&recv->header, mapping),
+				    sizeof(struct ib_mad_private) -
+				      sizeof(struct ib_mad_private_header),
+				    DMA_FROM_DEVICE);
 		kmem_cache_free(ib_mad_cache, recv);
 	}
 
diff -r c76ed2f1387b drivers/infiniband/core/uverbs_mem.c
--- a/drivers/infiniband/core/uverbs_mem.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/core/uverbs_mem.c	Wed Nov 29 13:54:36 2006 -0800
@@ -52,8 +52,8 @@ static void __ib_umem_release(struct ib_
 	int i;
 
 	list_for_each_entry_safe(chunk, tmp, &umem->chunk_list, list) {
-		dma_unmap_sg(dev->dma_device, chunk->page_list,
-			     chunk->nents, DMA_BIDIRECTIONAL);
+		ib_dma_unmap_sg(dev, chunk->page_list,
+				chunk->nents, DMA_BIDIRECTIONAL);
 		for (i = 0; i < chunk->nents; ++i) {
 			if (umem->writable && dirty)
 				set_page_dirty_lock(chunk->page_list[i].page);
@@ -136,10 +136,10 @@ int ib_umem_get(struct ib_device *dev, s
 				chunk->page_list[i].length = PAGE_SIZE;
 			}
 
-			chunk->nmap = dma_map_sg(dev->dma_device,
-						 &chunk->page_list[0],
-						 chunk->nents,
-						 DMA_BIDIRECTIONAL);
+			chunk->nmap = ib_dma_map_sg(dev,
+						    &chunk->page_list[0],
+						    chunk->nents,
+						    DMA_BIDIRECTIONAL);
 			if (chunk->nmap <= 0) {
 				for (i = 0; i < chunk->nents; ++i)
 					put_page(chunk->page_list[i].page);


From ralph.campbell at qlogic.com  Wed Dec  6 10:35:56 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Wed, 06 Dec 2006 10:35:56 -0800
Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs DMA
 mapping functions
Message-ID: <1165430156.14800.243.camel@brick.pathscale.com>

This version of the patch adds support for ib_dma_alloc_coherent()
and ib_dma_free_coherent().  It also fixes the bug Or found in
ipath_sync_single_for_cpu() and ipath_sync_single_for_device().


This patch implements the interposing DMA mapping functions to allow
support for IOMMUs and remove the dependence on phys_to_virt().

From: Ralph Campbell <ralph.campbell at qlogic.com>

diff -r c76ed2f1387b drivers/infiniband/hw/ipath/Makefile
--- a/drivers/infiniband/hw/ipath/Makefile	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/Makefile	Wed Nov 29 13:54:36 2006 -0800
@@ -6,6 +6,7 @@ ib_ipath-y := \
 ib_ipath-y := \
 	ipath_cq.o \
 	ipath_diag.o \
+	ipath_dma.o \
 	ipath_driver.o \
 	ipath_eeprom.o \
 	ipath_file_ops.o \
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_keys.c
--- a/drivers/infiniband/hw/ipath/ipath_keys.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_keys.c	Wed Nov 29 13:54:36 2006 -0800
@@ -134,7 +134,7 @@ int ipath_lkey_ok(struct ipath_qp *qp, s
 	 */
 	if (sge->lkey == 0) {
 		isge->mr = NULL;
-		isge->vaddr = bus_to_virt(sge->addr);
+		isge->vaddr = (void *) sge->addr;
 		isge->length = sge->length;
 		isge->sge_length = sge->length;
 		ret = 1;
@@ -202,12 +202,12 @@ int ipath_rkey_ok(struct ipath_qp *qp, s
 	int ret;
 
 	/*
-	 * We use RKEY == zero for physical addresses
-	 * (see ipath_get_dma_mr).
+	 * We use RKEY == zero for kernel virtual addresses
+	 * (see ipath_get_dma_mr and ipath_dma.c).
 	 */
 	if (rkey == 0) {
 		sge->mr = NULL;
-		sge->vaddr = phys_to_virt(vaddr);
+		sge->vaddr = (void *) vaddr;
 		sge->length = len;
 		sge->sge_length = len;
 		ss->sg_list = NULL;
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_mr.c
--- a/drivers/infiniband/hw/ipath/ipath_mr.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c	Wed Nov 29 13:54:37 2006 -0800
@@ -54,6 +54,8 @@ static inline struct ipath_fmr *to_ifmr(
  * @acc: access flags
  *
  * Returns the memory region on success, otherwise returns an errno.
+ * Note that all DMA addresses should be created via the
+ * struct ib_dma_mapping_ops functions (see ipath_dma.c).
  */
 struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc)
 {
@@ -149,8 +151,7 @@ struct ib_mr *ipath_reg_phys_mr(struct i
 	m = 0;
 	n = 0;
 	for (i = 0; i < num_phys_buf; i++) {
-		mr->mr.map[m]->segs[n].vaddr =
-			phys_to_virt(buffer_list[i].addr);
+		mr->mr.map[m]->segs[n].vaddr = (void *) buffer_list[i].addr;
 		mr->mr.map[m]->segs[n].length = buffer_list[i].size;
 		mr->mr.length += buffer_list[i].size;
 		n++;
@@ -347,7 +348,7 @@ int ipath_map_phys_fmr(struct ib_fmr *ib
 	n = 0;
 	ps = 1 << fmr->page_shift;
 	for (i = 0; i < list_len; i++) {
-		fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]);
+		fmr->mr.map[m]->segs[n].vaddr = (void *) page_list[i];
 		fmr->mr.map[m]->segs[n].length = ps;
 		if (++n == IPATH_SEGSZ) {
 			m++;
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Nov 29 13:54:37 2006 -0800
@@ -1599,6 +1599,7 @@ int ipath_register_ib_device(struct ipat
 	dev->detach_mcast = ipath_multicast_detach;
 	dev->process_mad = ipath_process_mad;
 	dev->mmap = ipath_mmap;
+	dev->dma_ops = &ipath_dma_mapping_ops;
 
 	snprintf(dev->node_desc, sizeof(dev->node_desc),
 		 IPATH_IDSTR " %s", init_utsname()->nodename);
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Wed Nov 29 13:54:37 2006 -0800
@@ -812,4 +812,6 @@ extern unsigned int ib_ipath_max_srq_wrs
 
 extern const u32 ib_ipath_rnr_table[];
 
+extern struct ib_dma_mapping_ops ipath_dma_mapping_ops;
+
 #endif				/* IPATH_VERBS_H */
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Tue Dec 05 16:04:53 2006 -0800
@@ -0,0 +1,262 @@
+/*
+ * Copyright (c) 2006 QLogic, Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_verbs.h>
+
+#include "ipath_verbs.h"
+
+#define BAD_DMA_ADDRESS ((u64) 0)
+
+/**
+ * ipath_dma_mapping_error - check a DMA address for error
+ * @dev: The device for which the dma_addr was created
+ * @dma_addr: The DMA address to check
+ */
+static int ipath_mapping_error(struct ib_device *dev, u64 dma_addr)
+{
+	return dma_addr == BAD_DMA_ADDRESS;
+}
+
+/**
+ * ipath_dma_map_single - Map a kernel virtual address to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @cpu_addr: The kernel virtual address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static u64 ipath_dma_map_single(struct ib_device *dev,
+			        void *cpu_addr, size_t size,
+			        enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+	return (u64) cpu_addr;
+}
+
+/**
+ * ipath_dma_unmap_single - Destroy a mapping created by ipath_dma_map_single()
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static void ipath_dma_unmap_single(struct ib_device *dev,
+				   u64 addr, size_t size,
+				   enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+/**
+ * ipath_dma_map_page - Map a physical page to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @page: The page to be mapped
+ * @offset: The offset within the page
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static u64 ipath_dma_map_page(struct ib_device *dev,
+			      struct page *page,
+			      unsigned long offset,
+			      size_t size,
+			      enum dma_data_direction direction)
+{
+	u64 addr;
+
+	BUG_ON(!valid_dma_direction(direction));
+
+	if (offset + size > PAGE_SIZE) {
+		addr = BAD_DMA_ADDRESS;
+		goto done;
+	}
+
+	addr = (u64) page_address(page);
+	if (addr)
+		addr += offset;
+	/* TODO: handle highmem pages */
+
+done:
+	return addr;
+}
+
+/**
+ * ipath_dma_unmap_page - Destroy a mapping created by ipath_dma_map_page()
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @direction: The direction of the DMA
+ */
+static void ipath_dma_unmap_page(struct ib_device *dev,
+				 u64 addr, size_t size,
+				 enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+/**
+ * ipath_map_sg - Map a scatter/gather list to DMA addresses
+ * @dev: The device for which the DMA addresses are to be created
+ * @sg: The array of scatter/gather entries
+ * @nents: The number of scatter/gather entries
+ * @direction: The direction of the DMA
+ */
+int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
+		 enum dma_data_direction direction)
+{
+	u64 addr;
+	int i;
+	int ret = nents;
+
+	BUG_ON(!valid_dma_direction(direction));
+
+	for (i = 0; i < nents; i++) {
+		addr = (u64) page_address(sg[i].page);
+		/* TODO: handle highmem pages */
+		if (!addr) {
+			ret = 0;
+			break;
+		}
+	}
+	return ret;
+}
+
+/**
+ * ipath_unmap_sg - Unmap a scatter/gather list of DMA addresses
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The array of scatter/gather entries
+ * @nents: The number of scatter/gather entries
+ * @direction: The direction of the DMA
+ */
+static void ipath_unmap_sg(struct ib_device *dev,
+			   struct scatterlist *sg, int nents,
+			   enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+/**
+ * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The scatter/gather entry
+ */
+static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
+{
+	return (u64) page_address(sg->page);
+}
+
+/**
+ * ipath_sg_dma_len - Return the DMA length from a scatter/gather entry
+ * @dev: The device for which the DMA addresses were created
+ * @sg: The scatter/gather entry
+ */
+static unsigned int ipath_sg_dma_len(struct ib_device *dev,
+				     struct scatterlist *sg)
+{
+	return sg->length;
+}
+
+/**
+ * ipath_sync_single_for_cpu - Prepare DMA region to be accessed by CPU
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @dir: The direction of the DMA
+ */
+static void ipath_sync_single_for_cpu(struct ib_device *dev,
+				      u64 addr,
+				      size_t size,
+				      enum dma_data_direction dir)
+{
+}
+
+/**
+ * ipath_sync_single_for_device - Prepare DMA region to be accessed by device
+ * @dev: The device for which the DMA address was created
+ * @addr: The DMA address
+ * @size: The size of the region in bytes
+ * @dir: The direction of the DMA
+ */
+static void ipath_sync_single_for_device(struct ib_device *dev,
+					 u64 addr,
+					 size_t size,
+					 enum dma_data_direction dir)
+{
+}
+
+/**
+ * ipath_dma_alloc_coherent - Allocate memory and map it for DMA
+ * @dev: The device for which the DMA address is requested
+ * @size: The size of the region to allocate in bytes
+ * @dma_handle: A pointer for returning the DMA address of the region
+ * @flag: memory allocator flags
+ */
+static void *ipath_dma_alloc_coherent(struct ib_device *dev, size_t size,
+				      u64 *dma_handle, gfp_t flag)
+{
+	struct page *p;
+	void *addr = NULL;
+
+	p = alloc_pages(flag, get_order(size));
+	if (p)
+		addr = page_address(p);
+	if (dma_handle)
+		*dma_handle = (u64) addr;
+	return addr;
+}
+
+/**
+ * ipath_dma_free_coherent - Free memory allocated by ib_dma_alloc_coherent()
+ * @dev: The device for which the DMA addresses were allocated
+ * @size: The size of the region
+ * @cpu_addr: the address returned by ib_dma_alloc_coherent()
+ * @dma_handle: the DMA address returned by ib_dma_alloc_coherent()
+ */
+static void ipath_dma_free_coherent(struct ib_device *dev, size_t size,
+				    void *cpu_addr, dma_addr_t dma_handle)
+{
+	free_pages((unsigned long) cpu_addr, get_order(size));
+}
+
+struct ib_dma_mapping_ops ipath_dma_mapping_ops = {
+	ipath_mapping_error,
+	ipath_dma_map_single,
+	ipath_dma_unmap_single,
+	ipath_dma_map_page,
+	ipath_dma_unmap_page,
+	ipath_map_sg,
+	ipath_unmap_sg,
+	ipath_sg_dma_address,
+	ipath_sg_dma_len,
+	ipath_sync_single_for_cpu,
+	ipath_sync_single_for_device,
+	ipath_dma_alloc_coherent,
+	ipath_dma_free_coherent
+};


From shubbell at dbresearch.net  Wed Dec  6 10:48:47 2006
From: shubbell at dbresearch.net (Sean Hubbell)
Date: Wed, 06 Dec 2006 12:48:47 -0600
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <1165429589.25587.136986.camel@hal.voltaire.com>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
Message-ID: <4577108F.9080308@dbresearch.net>

Hal Rosenstock wrote:
> Hi Sean,
>
> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
>   
>> Hello,
>>
>>   I was testing our code and noticed that when I send data using 
>> multicast over our ib0 interface, all of the infiniband switches route 
>> the data to each switch and each node instead of a node that has an 
>> application listening to the interface like Ethernet. Is this by design?
>>     
>
> It depends on what multicast group is being used and which end nodes
> have registered for that group as to where the data is routed.
>
> -- Hal
>   
Hey Hal,

  The multicast group I am sending data to is 224.10.10.x (not 
224.0.0.x) and I have no clients / nodes listening but the data is still 
being sent. I am using wwtop from warewulf to view the network load for 
each node. Does this make sense?

Sean


From xma at us.ibm.com  Wed Dec  6 11:06:31 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Wed, 6 Dec 2006 11:06:31 -0800
Subject: [openib-general] [PATCH/RFC 1/2] IB: Return
 "maybe_missed_event" hint from ib_req_notify_cq()
In-Reply-To: <ada3b84xiob.fsf@cisco.com>
Message-ID: <OFDE9B734C.B32B1C33-ON8725723C.006810AA-8825723C.0068F5B7@us.ibm.com>


Hi, Roland,

We have found missing interrupts in ehca driver none scaling code. We are
testing the patch now. I will let you know when we pass the test ASAP.

Does your patch use netif_receive_skb or netif_rx_ni() in IPoIB receiving
path? I haven't looked at your most recent git tree yet. If it's
netif_rx_ni(), that's wrong. NAPI should avoid IP backlog queue.

As we discussed before, I suggested to use return (unlikely(missed_event) &
netif_reschedule_rx()) instead of going back polling cq again and again.
ehca delivers packets too fast, according to my debug output, I could get
up to 58 missed_events between notify_cq and netif_reschedule_rx() to exit
from NAPI poll.

Sorry to block your NAPI patch that long. Are you still planning to use
NAPI as default or as an configuration option? As Michael's pointed out,
under some situation (like heavy load), NAPI might not be a good choice.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/0f926a49/attachment.html>

From elsen_david at yahoo.com  Wed Dec  6 11:17:53 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 11:17:53 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1165429652.25183.16.camel@stevo-desktop>
Message-ID: <742095.47380.qm@web58014.mail.re3.yahoo.com>

Steve,
   
  Thanks a lot for the reply. 
   
  I could run the cpi from the example directory. 
   
  But I see some error message when trying to run the IMB-MPI1. I am using 219297_IMB_2.3. Which version are you using?
   
  David

Steve Wise <swise at opengridcomputing.com> wrote:
  On Wed, 2006-12-06 at 10:03 -0800, david elsen wrote:
> Shaun / Steve,
> 
> To pass the "librdmacm.so: cannot open shared object file: No such
> file or
> >> directory" error message, LD_RUN_PATH also need to be set. 
> 
> Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying
> to figure out how to run the various nenchmark tests using this MPI
> tool.
> 
> Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also
> want to run the OSC benchmark tests. Any guideline availabvle for
> these please?
> Thanks,
> David

I've run IMB benchmarks (aka pallas) on mvapich2 0.9.8 over iwarp. The
mvapich2 user guide explains how to start up mpd daemons and use
mpiexec. Its fairly straight forward. You need ssh or rsh access and
you need to setup a few files. 

Then pull down IMB and build it.

To run 2 node IMB-MPI1 tests, you do something like this:

$ mpdboot -n 2
$ mpiexec -n 2 
/IMB-MPI1 

This will run the entire MPI1 suite.


Steve.


---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/e4b4da5d/attachment.html>

From swise at opengridcomputing.com  Wed Dec  6 11:23:19 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 06 Dec 2006 13:23:19 -0600
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <742095.47380.qm@web58014.mail.re3.yahoo.com>
References: <742095.47380.qm@web58014.mail.re3.yahoo.com>
Message-ID: <1165432999.25183.26.camel@stevo-desktop>

On Wed, 2006-12-06 at 11:17 -0800, david elsen wrote:
> Steve,
>  
> Thanks a lot for the reply. 
>  
> I could run the cpi from the example directory. 
>  
> But I see some error message when trying to run the IMB-MPI1. I am
> using 219297_IMB_2.3. Which version are you using?

I'm running the same release.

Steve.


From rowland at cse.ohio-state.edu  Wed Dec  6 11:16:46 2006
From: rowland at cse.ohio-state.edu (Shaun Rowland)
Date: Wed, 06 Dec 2006 14:16:46 -0500
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com>
References: <20061206180350.2306.qmail@web58004.mail.re3.yahoo.com>
Message-ID: <4577171E.6060205@cse.ohio-state.edu>

david elsen wrote:
> Shaun / Steve,
> 
> To pass the "librdmacm.so: cannot open shared object file: No such file or
>  >> directory" error message, LD_RUN_PATH also need to be set.
> 
> Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying 
> to figure out how to run the various nenchmark tests using this MPI tool.
> 
> Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want 
> to run the OSC benchmark tests. Any guideline availabvle for these please?
> Thanks,

We run these tests. For IMB (Pallas), you can look at the
doc/ReadMe_IMB.txt in the source to see more details. For the OSU
benchmarks, you can simply build them with mpicc and run them on 2 nodes
or 1 node.
-- 
Shaun Rowland	rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


From elsen_david at yahoo.com  Wed Dec  6 11:30:52 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 11:30:52 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <1165432999.25183.26.camel@stevo-desktop>
Message-ID: <957376.73440.qm@web58010.mail.re3.yahoo.com>

Steve,
Somehow I get the following error message:

[0] Abort: [] Got completion with error 5, vendor code=a, dest rank=1
  at line 479 in file ibv_channel_manager.c
 [1] Abort: ibv_post_recv err with 22 at line 1420 in file rdma_iba_priv.c
 rank 1 in job 1  ammasso1_50414   caused collective abort of all ranks
   exit status of rank 1: killed by signal 9 


For detail, please see the following:
[root at ammasso1 0.9.8-RELEASE]# vi /etc/hosts
[root at ammasso1 0.9.8-RELEASE]# cd bin
[root at ammasso1 bin]# ./mpdboot -n 2
debug: starting
mpdroot: perror msg: Connection refused
running mpdallexit on ammasso1
LAUNCHED mpd on ammasso1  via  
debug: launch cmd= /root/0.9.8-RELEASE/bin/mpd.py   --ncpus=1 -e -d
debug: mpd on ammasso1  on port 50414
RUNNING: mpd on ammasso1
debug: info for running mpd: {'ncpus': 1, 'list_port': 50414, 'entry_port': '', 'host': 'ammasso1', 'entry_host': '', 'ifhn': ''}
LAUNCHED mpd on ammasso2  via  ammasso1
debug: launch cmd= ssh -x -n ammasso2.
 '/root/0.9.8-RELEASE/bin/mpd.py  -h ammasso1 -p 50414  --ncpus=1 -e -d' 
root at ammasso2.'s password: 
debug: mpd on ammasso2  on port 59327
RUNNING: mpd on ammasso2
debug: info for running mpd: {'entry_port': 50414, 'ncpus': 1, 'list_port': 59327, 'pid': 2997, 'host': 'ammasso2., 'entry_host': 'ammasso1', 'ifhn': ''}


[root at ammasso1 bin]# ./mpiexec -n 2 /root/IMB_2.3/src/IMB-MPI1
secretword=
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V2.3, MPI-1 part    
#---------------------------------------------------
# Date       : Wed Dec  6 13:25:59 2006
# Machine    : i686# System     : Linux
# Release    : 2.6.17.13
# Version    : #1 SMP Wed Nov 8 17:34:14 PST 2006

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE 
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM  
#
#

# List of Benchmarks to run:

# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier
recv desc error, 128
[0] Abort: [] Got completion with error 5, vendor code=a, dest rank=1
 at line 479 in file ibv_channel_manager.c
[1] Abort: ibv_post_recv err with 22 at line 1420 in file rdma_iba_priv.c
rank 1 in job 1  ammasso1_50414   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9 

David


Steve Wise <swise at opengridcomputing.com> wrote: On Wed, 2006-12-06 at 11:17 -0800, david elsen wrote:
> Steve,
>  
> Thanks a lot for the reply. 
>  
> I could run the cpi from the example directory. 
>  
> But I see some error message when trying to run the IMB-MPI1. I am
> using 219297_IMB_2.3. Which version are you using?

I'm running the same release.

Steve.


---------------------------------
Everyone is raving about the all-new Yahoo! Mail beta.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/4ab94e88/attachment.html>

From elsen_david at yahoo.com  Wed Dec  6 11:40:55 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 11:40:55 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <4577171E.6060205@cse.ohio-state.edu>
Message-ID: <340888.34199.qm@web58004.mail.re3.yahoo.com>

Shaun,

I tried this and am getting some error messages:

Please see following:
[root at ammasso2 osu_benchmarks]# mpicc osu_latency.c
[root at ammasso2 osu_benchmarks]# ls
a.out              osu_bw.c           osu_latency.c     osu_put_bw.c
osu_acc_latency.c  osu_get_bw.c       osu_latency_mt.c  osu_put_latency.c
osu_bibw.c         osu_get_latency.c  osu_put_bibw.c
[root at ammasso2 osu_benchmarks]# ./a.out
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 
:
system msg for write_line failure : Bad file descriptor
[unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 
:
[0] Abort: PMI Lookup name failed
 at line 519 in file rdma_cm.c

I get similar error message for all the tests:
[root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/pingping 
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 
:
system msg for write_line failure : Bad file descriptor
[unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 
:
[0] Abort: PMI Lookup name failed
 at line 519 in file rdma_cm.c
[root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend
bash: /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend: No such file or directory
[root at ammasso2 pt2pt]# /root/0.9.8-RELEASE/test/mpi/pt2pt/bsend1
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_0 key=0 
:
system msg for write_line failure : Bad file descriptor
[unset]: got unexpected response to get :cmd=get kvsname=singinit_kvs_0 key=0 
:
[0] Abort: PMI Lookup name failed
 at line 519 in file rdma_cm.c
[root at ammasso2 pt2pt]# 

David

Shaun Rowland <rowland at cse.ohio-state.edu> wrote: david elsen wrote:
> Shaun / Steve,
> 
> To pass the "librdmacm.so: cannot open shared object file: No such file or
>  >> directory" error message, LD_RUN_PATH also need to be set.
> 
> Anyway, after I am able to run the mvapich2 0.9.8-Release, I am trying 
> to figure out how to run the various nenchmark tests using this MPI tool.
> 
> Has anyone run the Pallas tool with the OSC MPI or OpenMPI. I also want 
> to run the OSC benchmark tests. Any guideline availabvle for these please?
> Thanks,

We run these tests. For IMB (Pallas), you can look at the
doc/ReadMe_IMB.txt in the source to see more details. For the OSU
benchmarks, you can simply build them with mpicc and run them on 2 nodes
or 1 node.
-- 
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/20e2374c/attachment.html>

From steve.apo at googlemail.com  Wed Dec  6 11:45:54 2006
From: steve.apo at googlemail.com (Steven Wooding)
Date: Wed, 6 Dec 2006 19:45:54 +0000
Subject: [openib-general] [CM] ib_cm_sens_req() returns -1. What could
 be wrong?
In-Reply-To: <2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com>
References: <2cfcf21e0612050711y274ea297r1f599affcff0468e@mail.gmail.com>
	<4575B9CB.5070507@ichips.intel.com>
	<2cfcf21e0612051128k59f32e99u42cd7e761063786f@mail.gmail.com>
Message-ID: <2cfcf21e0612061145i346b99e8n9074218547947aec@mail.gmail.com>

Hi Sean,

Thanks for the tip. I wasn't setting the QP type properly. Fixed now.

Cheers,

Steve.

On 05/12/06, Steven Wooding <steve.apo at googlemail.com> wrote:
>
> Hi Sean,
>
> Yeah, in my second post I said that errno was EINVAL just after the
> ib_cm_send_req() call, which I assume was from the write() call. Or did you
> mean something else?
>
> Steve.
>
> On 05/12/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
> >
> > > In my application I keep getting -1 returned by a call to
> > > ib_cm_send_req() function. The cmpost example application works fine,
> > so
> > > I can rule out system set-up issues.
> >
> > This is probably an error being returned from the kernel.  Does errno
> > give any
> > more insight?
> >
> > - Sean
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/f1d2308d/attachment.html>

From halr at voltaire.com  Wed Dec  6 11:52:07 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 14:52:07 -0500
Subject: [openib-general] osm: More simulation faiures on trunk
In-Reply-To: <45770599.7080005@mellanox.co.il>
References: <45770599.7080005@mellanox.co.il>
Message-ID: <1165434717.25587.140668.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote:
> Hi Hal,
> 
> Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when 
> opensm
> is invoked with updn routing engine.

Are you referring to certain LIDs being UNREACHABLE like this:
LID    : Port : Hops : Optimal
0x0001 : UNREACHABLE
0x0002 : UNREACHABLE
0x0003 : 000  : 00   : yes
0x0004 : 001  : 02   : yes
0x0005 : 003  : 02   : yes
0x0006 : 001  : 01   : yes
0x0007 : UNREACHABLE
0x0008 : UNREACHABLE
0x0009 : UNREACHABLE
0x000A : 001  : 02   : yes
0x000B : 003  : 02   : yes

So should UNREACHABLE LIDs just not be put into the file ? Or is it
something else ?

> I will be working on finding what changed between OFED 1.1 and the trunk.

It was likely introduced by the changes to the routing engines committed
yesterday and sent on the last in late Novemeber. git-bisect can help
isolate exactly which change.

> This is another cause for the failure of all osmMulticastRoutingTest and 
> osmStability tests runs.

> Another one would be the change of the osm.mcfdbs which is parsed by 
> IBDM too.

Are you asking about the other patch again ?

-- Hal

> Eitan
> 


From halr at voltaire.com  Wed Dec  6 12:04:07 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 15:04:07 -0500
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <4577108F.9080308@dbresearch.net>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
	<4577108F.9080308@dbresearch.net>
Message-ID: <1165435407.25587.141052.camel@hal.voltaire.com>

On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote:
> Hal Rosenstock wrote:
> > Hi Sean,
> >
> > On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
> >   
> >> Hello,
> >>
> >>   I was testing our code and noticed that when I send data using 
> >> multicast over our ib0 interface, all of the infiniband switches route 
> >> the data to each switch and each node instead of a node that has an 
> >> application listening to the interface like Ethernet. Is this by design?
> >>     
> >
> > It depends on what multicast group is being used and which end nodes
> > have registered for that group as to where the data is routed.
> >
> > -- Hal
> >   
> Hey Hal,
> 
>   The multicast group I am sending data to is 224.10.10.x (not 
> 224.0.0.x) and I have no clients / nodes listening but the data is still 
> being sent.

Yes, if there is only a sender, the data should not be routed anywhere.

>  I am using wwtop from warewulf to view the network load for 
> each node.

I'm not familiar with those tools.

>  Does this make sense?

Nope. To state the obvious, something is not as it seems...

Can you state which SM you are using ?

Also, can you do the following:
saquery -g
saquery -m
and send me the output.

I may have some more experiments once I get that level of info.

-- Hal

> Sean


From rowland at cse.ohio-state.edu  Wed Dec  6 12:42:10 2006
From: rowland at cse.ohio-state.edu (Shaun Rowland)
Date: Wed, 06 Dec 2006 15:42:10 -0500
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <340888.34199.qm@web58004.mail.re3.yahoo.com>
References: <340888.34199.qm@web58004.mail.re3.yahoo.com>
Message-ID: <45772B22.2090702@cse.ohio-state.edu>

david elsen wrote:
> Shaun,
> 
> I tried this and am getting some error messages:
> 
> Please see following:
> [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c
> [root at ammasso2 osu_benchmarks]# ls
> a.out              osu_bw.c           osu_latency.c     osu_put_bw.c
> osu_acc_latency.c  osu_get_bw.c       osu_latency_mt.c  osu_put_latency.c
> osu_bibw.c         osu_get_latency.c  osu_put_bibw.c
> [root at ammasso2 osu_benchmarks]# ./a.out

You need to execute these with mpiexec after starting mpdboot, so the
process would be something like:

mpicc -o osu_lat osu_latency.c
mpdboot -n 2 -f hosts
mpiexec -n 2 ./osu_lat
....
mpdallexit

As detailed in the User Guide:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2

You should also see this section of the User Guide if you have problems
with iWARP:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3

Also, this section describes using iWARP with MVAPICH2:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8

Have you set up everything (like /etc/mv2.conf)? Are you using
the environment variable MV2_USE_RDMA_CM as described above? With the
mpiexec command, it should be enough to export this variable to a value
of 1 in the same environment in which you execute mpiexec - this will
automatically propagate to the processes on remote machines.
-- 
Shaun Rowland	rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


From shubbell at dbresearch.net  Wed Dec  6 13:06:13 2006
From: shubbell at dbresearch.net (Sean Hubbell)
Date: Wed, 06 Dec 2006 15:06:13 -0600
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <1165435407.25587.141052.camel@hal.voltaire.com>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
	<4577108F.9080308@dbresearch.net>
	<1165435407.25587.141052.camel@hal.voltaire.com>
Message-ID: <457730C5.9000902@dbresearch.net>

Hal Rosenstock wrote:
> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote:
>   
>> Hal Rosenstock wrote:
>>     
>>> Hi Sean,
>>>
>>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
>>>   
>>>       
>>>> Hello,
>>>>
>>>>   I was testing our code and noticed that when I send data using 
>>>> multicast over our ib0 interface, all of the infiniband switches route 
>>>> the data to each switch and each node instead of a node that has an 
>>>> application listening to the interface like Ethernet. Is this by design?
>>>>     
>>>>         
>>> It depends on what multicast group is being used and which end nodes
>>> have registered for that group as to where the data is routed.
>>>
>>> -- Hal
>>>   
>>>       
>> Hey Hal,
>>
>>   The multicast group I am sending data to is 224.10.10.x (not 
>> 224.0.0.x) and I have no clients / nodes listening but the data is still 
>> being sent.
>>     
>
> Yes, if there is only a sender, the data should not be routed anywhere.
>
>   
>>  I am using wwtop from warewulf to view the network load for 
>> each node.
>>     
>
> I'm not familiar with those tools.
>
>   
>>  Does this make sense?
>>     
>
> Nope. To state the obvious, something is not as it seems...
>
> Can you state which SM you are using ?
>
> Also, can you do the following:
> saquery -g
> saquery -m
> and send me the output.
>
> I may have some more experiments once I get that level of info.
>
> -- Hal
>   
We have a Voltaire HW subnet manager. I do not have the saquery command. 
I'll have to find this and install it. Would the web interface help?

Sean


From vishal at endace.com  Wed Dec  6 13:23:17 2006
From: vishal at endace.com (vishal)
Date: Thu, 07 Dec 2006 10:23:17 +1300
Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem
Message-ID: <1165440197.2894.5.camel@julia.et.endace.com>

Hi,

      Was trying to install IBGOLD on Red Hat 4 (x86_64), and the
following is the 'error' part from a log file. I couldn't find the
-Xcompiler option in the gcc manual. Am I missing something ?

configure:2466: $? = 0
configure:2468: gcc -v </dev/null >&5
Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=x86_64-redhat-linux
Thread model: posix
gcc version 3.4.6 20060404 (Red Hat 3.4.6-3)
configure:2471: $? = 0
configure:2473: gcc -V </dev/null >&5
gcc: `-V' option must have argument
configure:2476: $? = 1
configure:2499: checking for C compiler default output file name
configure:2502: gcc -m32  -m32 -Xcompiler -m32 conftest.c  >&5
gcc: unrecognized option `-Xcompiler'
/usr/bin/ld: crt1.o: No such file: No such file or directory
collect2: ld returned 1 exit status


Thanks!

Vishal


From halr at voltaire.com  Wed Dec  6 13:38:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 16:38:54 -0500
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <457730C5.9000902@dbresearch.net>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
	<4577108F.9080308@dbresearch.net>
	<1165435407.25587.141052.camel@hal.voltaire.com>
	<457730C5.9000902@dbresearch.net>
Message-ID: <1165441086.25587.144751.camel@hal.voltaire.com>

On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote:
> Hal Rosenstock wrote:
> > On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote:
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> Hi Sean,
> >>>
> >>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
> >>>   
> >>>       
> >>>> Hello,
> >>>>
> >>>>   I was testing our code and noticed that when I send data using 
> >>>> multicast over our ib0 interface, all of the infiniband switches route 
> >>>> the data to each switch and each node instead of a node that has an 
> >>>> application listening to the interface like Ethernet. Is this by design?
> >>>>     
> >>>>         
> >>> It depends on what multicast group is being used and which end nodes
> >>> have registered for that group as to where the data is routed.
> >>>
> >>> -- Hal
> >>>   
> >>>       
> >> Hey Hal,
> >>
> >>   The multicast group I am sending data to is 224.10.10.x (not 
> >> 224.0.0.x) and I have no clients / nodes listening but the data is still 
> >> being sent.
> >>     
> >
> > Yes, if there is only a sender, the data should not be routed anywhere.
> >
> >   
> >>  I am using wwtop from warewulf to view the network load for 
> >> each node.
> >>     
> >
> > I'm not familiar with those tools.
> >
> >   
> >>  Does this make sense?
> >>     
> >
> > Nope. To state the obvious, something is not as it seems...
> >
> > Can you state which SM you are using ?
> >
> > Also, can you do the following:
> > saquery -g
> > saquery -m
> > and send me the output.
> >
> > I may have some more experiments once I get that level of info.
> >
> > -- Hal
> >   
> We have a Voltaire HW subnet manager. I do not have the saquery command. 
> I'll have to find this and install it.

What is running on your end nodes ? Is it OpenIB/OFED or something else
? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2
supports the options I mentioned.

>  Would the web interface help?

Not sure whether there is anything there for this.

-- Hal

> 
> Sean


From elsen_david at yahoo.com  Wed Dec  6 13:44:49 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 13:44:49 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <45772B22.2090702@cse.ohio-state.edu>
Message-ID: <336295.54543.qm@web58008.mail.re3.yahoo.com>

oops, sorry, my fault. I will try it again.

Shaun Rowland <rowland at cse.ohio-state.edu> wrote: david elsen wrote:
> Shaun,
> 
> I tried this and am getting some error messages:
> 
> Please see following:
> [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c
> [root at ammasso2 osu_benchmarks]# ls
> a.out              osu_bw.c           osu_latency.c     osu_put_bw.c
> osu_acc_latency.c  osu_get_bw.c       osu_latency_mt.c  osu_put_latency.c
> osu_bibw.c         osu_get_latency.c  osu_put_bibw.c
> [root at ammasso2 osu_benchmarks]# ./a.out

You need to execute these with mpiexec after starting mpdboot, so the
process would be something like:

mpicc -o osu_lat osu_latency.c
mpdboot -n 2 -f hosts
mpiexec -n 2 ./osu_lat
....
mpdallexit

As detailed in the User Guide:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2

You should also see this section of the User Guide if you have problems
with iWARP:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3

Also, this section describes using iWARP with MVAPICH2:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8

Have you set up everything (like /etc/mv2.conf)? Are you using
the environment variable MV2_USE_RDMA_CM as described above? With the
mpiexec command, it should be enough to export this variable to a value
of 1 in the same environment in which you execute mpiexec - this will
automatically propagate to the processes on remote machines.
-- 
Shaun Rowland rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/


---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/e21b463f/attachment.html>

From elsen_david at yahoo.com  Wed Dec  6 13:50:26 2006
From: elsen_david at yahoo.com (david elsen)
Date: Wed, 6 Dec 2006 13:50:26 -0800 (PST)
Subject: [openib-general] openMPI for 2.6.17.10 kernel
In-Reply-To: <336295.54543.qm@web58008.mail.re3.yahoo.com>
Message-ID: <24626.54501.qm@web58007.mail.re3.yahoo.com>

Shan /Steve,
I could run the osu_lat on my set-up with two Ammasso cards. 
Thanks,
David 
Thanks a lot for the help,
david elsen <elsen_david at yahoo.com> wrote: oops, sorry, my fault. I will try it again.

Shaun Rowland <rowland at cse.ohio-state.edu> wrote: david elsen wrote:
> Shaun,
> 
> I tried this and am getting some error messages:
> 
> Please see following:
> [root at ammasso2 osu_benchmarks]# mpicc osu_latency.c
> [root at ammasso2 osu_benchmarks]# ls
> a.out              osu_bw.c           osu_latency.c     osu_put_bw.c
> osu_acc_latency.c  osu_get_bw.c       osu_latency_mt.c  osu_put_latency.c
> osu_bibw.c         osu_get_latency.c  osu_put_bibw.c
> [root at ammasso2 osu_benchmarks]# ./a.out

You need to execute these with mpiexec after starting mpdboot, so the
process would be something like:

mpicc -o osu_lat osu_latency.c
mpdboot -n 2 -f hosts
mpiexec -n 2 ./osu_lat
....
mpdallexit

As detailed  in the User Guide:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-170005.2

You should also see this section of the User Guide if you have problems
with iWARP:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-400007.3

Also, this section describes using iWARP with MVAPICH2:

http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html#x1-230005.8

Have you set up everything (like /etc/mv2.conf)? Are you using
the environment variable MV2_USE_RDMA_CM as described above? With the
mpiexec command, it should be enough to export this variable to a value
of 1 in the same environment in which you execute mpiexec - this will
automatically propagate to the processes on remote machines.
-- 
Shaun Rowland  rowland at cse.ohio-state.edu
http://www.cse.ohio-state.edu/~rowland/

   
---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now._______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

 
---------------------------------
Any questions?  Get answers on any topic at Yahoo! Answers. Try it now.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061206/5abe27e1/attachment.html>

From eitan at mellanox.co.il  Wed Dec  6 13:56:25 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 06 Dec 2006 23:56:25 +0200
Subject: [openib-general] osm: More simulation faiures on trunk
In-Reply-To: <1165434717.25587.140668.camel@hal.voltaire.com>
References: <45770599.7080005@mellanox.co.il>
	<1165434717.25587.140668.camel@hal.voltaire.com>
Message-ID: <45773C89.9060901@mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote:
>   
>> Hi Hal,
>>
>> Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when 
>> opensm
>> is invoked with updn routing engine.
>>     
>
> Are you referring to certain LIDs being UNREACHABLE like this:
> LID    : Port : Hops : Optimal
> 0x0001 : UNREACHABLE
> 0x0002 : UNREACHABLE
> 0x0003 : 000  : 00   : yes
> 0x0004 : 001  : 02   : yes
> 0x0005 : 003  : 02   : yes
> 0x0006 : 001  : 01   : yes
> 0x0007 : UNREACHABLE
> 0x0008 : UNREACHABLE
> 0x0009 : UNREACHABLE
> 0x000A : 001  : 02   : yes
> 0x000B : 003  : 02   : yes
>
> So should UNREACHABLE LIDs just not be put into the file ? Or is it
> something else ?
>
>   
The UNREACHABLE is fine.
The problem is that ALL LFTs are full of UNREACHABLE. Actually there are 
no reachable nodes ...
>> I will be working on finding what changed between OFED 1.1 and the trunk.
>>     
>
> It was likely introduced by the changes to the routing engines committed
> yesterday and sent on the last in late Novemeber. git-bisect can help
> isolate exactly which change.
>   
Thanks I will follow that trail
>   
>> This is another cause for the failure of all osmMulticastRoutingTest and 
>> osmStability tests runs.
>>     
>
>   
>> Another one would be the change of the osm.mcfdbs which is parsed by 
>> IBDM too.
>>     
>
> Are you asking about the other patch again ?
>   
Do you have an estimate for when my patch will be merged ?

> -- Hal
>
>   
>> Eitan
>>
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Wed Dec  6 14:15:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 17:15:44 -0500
Subject: [openib-general] [PATCH][MINOR] OpenSM/osm_sa_informinfo.c:
 Conformance changes for subscribe component
Message-ID: <1165443320.25587.146153.camel@hal.voltaire.com>

OpenSM/osm_sa_informinfo.c: Conformance changes for subscribe component

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c
index ad705b5..5d81b84 100644
--- a/osm/opensm/osm_sa_informinfo.c
+++ b/osm/opensm/osm_sa_informinfo.c
@@ -339,9 +339,6 @@ __osm_infr_rcv_respond(
 
   p_resp_infr = (ib_inform_info_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
 
-  /* confirm success */
-  p_resp_infr->subscribe = 1;
-
   status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw,  FALSE );
 
   if ( status != IB_SUCCESS )
@@ -754,6 +751,20 @@ osm_infr_rcv_process_set_method(
     goto Exit;
   }
 
+  /* Subscribe values above 1 are undefined */
+  if (p_recvd_inform_info->subscribe > 1)
+  {
+    cl_plock_release( p_rcv->p_lock );
+
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_infr_rcv_process_set_method: ERR 4308 "
+             "Invalid subscribe: %d\n",
+             p_recvd_inform_info->subscribe
+             );
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_REQ_INVALID );
+    goto Exit;
+  }
+
   /*
    * MODIFICATIONS DONE ON INCOMING REQUEST:
    *


From halr at voltaire.com  Wed Dec  6 14:46:14 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 17:46:14 -0500
Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_ucast_updn.c: In
 updn_init, add routine exit osm_log message for an error case
Message-ID: <1165445137.25587.147342.camel@hal.voltaire.com>

OpenSM/osm_ucast_updn.c: In updn_init, add routine exit osm_log message
for an error case

Also, some cosmetic changes

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_ucast_updn.c b/osm/opensm/osm_ucast_updn.c
index 7e6a6d5..b0ea721 100644
--- a/osm/opensm/osm_ucast_updn.c
+++ b/osm/opensm/osm_ucast_updn.c
@@ -33,7 +33,6 @@
  *
  */
 
-
 /*
  * Abstract:
  *      Implementation of Up Down Algorithm using ranking & Min Hop
@@ -272,7 +271,7 @@ __updn_bfs_by_node(
              "__updn_bfs_by_node:"
              "Update Min Hop Table of GUID 0x%" PRIx64 "\n",
              cl_ntoh64(p_port->guid) );
-    osm_switch_set_hops(p_self_node, root_lid , 0, 0);
+    osm_switch_set_hops(p_self_node, root_lid, 0, 0);
   }
   else
   {
@@ -598,7 +597,7 @@ updn_init(
   if (!p_list)
   {
     status = IB_ERROR;
-    goto Exit_Bad;
+    goto Exit;
   }
 
   cl_list_construct( p_list );
@@ -630,7 +629,7 @@ updn_init(
     {
       if (strcspn(line, " ,;.") == strlen(line))
       {
-        /* Skip empty Lines anywhere in the file - only one char means the Null termination */
+        /* Skip empty lines anywhere in the file - only one char means the Null termination */
         if (strlen(line) > 1)
         {
           p_tmp = malloc(sizeof(uint64_t));
@@ -670,12 +669,8 @@ updn_init(
   }
   /* If auto mode detection required - will be executed in main b4 the assignment of UI Ucast */
 
-  goto Exit;
-
-  Exit_Bad :
-    return 1;
-  Exit :
-    OSM_LOG_EXIT( &p_osm->log );
+Exit :
+  OSM_LOG_EXIT( &p_osm->log );
   return (status);
 }
 

From halr at voltaire.com  Wed Dec  6 15:20:41 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 06 Dec 2006 18:20:41 -0500
Subject: [openib-general] osm: More simulation faiures on trunk
In-Reply-To: <45773C89.9060901@mellanox.co.il>
References: <45770599.7080005@mellanox.co.il>
	<1165434717.25587.140668.camel@hal.voltaire.com>
	<45773C89.9060901@mellanox.co.il>
Message-ID: <1165447232.25587.148638.camel@hal.voltaire.com>

Hi Eitan,

On Wed, 2006-12-06 at 16:56, Eitan Zahavi wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Wed, 2006-12-06 at 13:02, Eitan Zahavi wrote:
> >   
> >> Hi Hal,
> >>
> >> Looks like the osm.fdbs file is now created with "UNREACHABLE" mark when 
> >> opensm
> >> is invoked with updn routing engine.
> >>     
> >
> > Are you referring to certain LIDs being UNREACHABLE like this:
> > LID    : Port : Hops : Optimal
> > 0x0001 : UNREACHABLE
> > 0x0002 : UNREACHABLE
> > 0x0003 : 000  : 00   : yes
> > 0x0004 : 001  : 02   : yes
> > 0x0005 : 003  : 02   : yes
> > 0x0006 : 001  : 01   : yes
> > 0x0007 : UNREACHABLE
> > 0x0008 : UNREACHABLE
> > 0x0009 : UNREACHABLE
> > 0x000A : 001  : 02   : yes
> > 0x000B : 003  : 02   : yes
> >
> > So should UNREACHABLE LIDs just not be put into the file ? Or is it
> > something else ?
> >
> >   
> The UNREACHABLE is fine.
> The problem is that ALL LFTs are full of UNREACHABLE. Actually there are 
> no reachable nodes ...

Weird. It works on my topology (with UPDN).

> >> I will be working on finding what changed between OFED 1.1 and the trunk.
> >>     
> >
> > It was likely introduced by the changes to the routing engines committed
> > yesterday and sent on the last in late Novemeber. git-bisect can help
> > isolate exactly which change.
> >   
> Thanks I will follow that trail
> >   
> >> This is another cause for the failure of all osmMulticastRoutingTest and 
> >> osmStability tests runs.
> >>     
> >
> >   
> >> Another one would be the change of the osm.mcfdbs which is parsed by 
> >> IBDM too.
> >>     
> >
> > Are you asking about the other patch again ?
> >   
> Do you have an estimate for when my patch will be merged ?

I already answered this in an earlier email.

-- Hal

> > -- Hal
> >
> >   
> >> Eitan
> >>
> >>     
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From boris at mellanox.com  Wed Dec  6 15:54:16 2006
From: boris at mellanox.com (Boris Shpolyansky)
Date: Wed, 6 Dec 2006 15:54:16 -0800
Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem
Message-ID: <1E3DCD1C63492545881FACB6063A57C16E4132@mtiexch01.mti.com>

What IBGD version you are using ?

Boris Shpolyansky
Application Engineer
Mellanox Technologies Inc.
2900 Stender Way
Santa Clara, CA 95054
Tel.: (408) 916 0014
Fax: (408) 970 3403
Cell: (408) 834 9365
www.mellanox.com
 

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of vishal
Sent: Wednesday, December 06, 2006 1:23 PM
To: openib-general at openib.org
Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem

Hi,

      Was trying to install IBGOLD on Red Hat 4 (x86_64), and the
following is the 'error' part from a log file. I couldn't find the
-Xcompiler option in the gcc manual. Am I missing something ?

configure:2466: $? = 0
configure:2468: gcc -v </dev/null >&5
Reading specs from /usr/lib/gcc/x86_64-redhat-linux/3.4.6/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--disable-checking --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-java-awt=gtk
--host=x86_64-redhat-linux Thread model: posix gcc version 3.4.6
20060404 (Red Hat 3.4.6-3)
configure:2471: $? = 0
configure:2473: gcc -V </dev/null >&5
gcc: `-V' option must have argument
configure:2476: $? = 1
configure:2499: checking for C compiler default output file name
configure:2502: gcc -m32  -m32 -Xcompiler -m32 conftest.c  >&5
gcc: unrecognized option `-Xcompiler'
/usr/bin/ld: crt1.o: No such file: No such file or directory
collect2: ld returned 1 exit status


Thanks!

Vishal


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From gregkh at suse.de  Wed Dec  6 22:12:01 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:12:01 -0800
Subject: [openib-general] patch
 pci-only-check-the-ht-capability-bits-in-mpic.c.patch added to gregkh-2.6
 tree
In-Reply-To: <20061122072626.94B1A67C3C@ozlabs.org>
Message-ID: <20061207061210.04E90A609F3@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Only check the HT capability bits in mpic.c

to my gregkh-2.6 tree.  Its filename is

     pci-only-check-the-ht-capability-bits-in-mpic.c.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael at ozlabs.org Tue Nov 21 23:26:32 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
CC: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W. Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:22 +1100
Subject: PCI: Only check the HT capability bits in mpic.c
Message-Id: <20061122072626.94B1A67C3C at ozlabs.org>

Only compare the exact HT capability bits against HT_CAPTYPE_IRQ,
this is a little paranoid, but doesn't hurt.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 arch/powerpc/sysdev/mpic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- gregkh-2.6.orig/arch/powerpc/sysdev/mpic.c
+++ gregkh-2.6/arch/powerpc/sysdev/mpic.c
@@ -390,7 +390,7 @@ static void __init mpic_scan_ht_pic(stru
 		u8 id = readb(devbase + pos + PCI_CAP_LIST_ID);
 		if (id == PCI_CAP_ID_HT) {
 			id = readb(devbase + pos + 3);
-			if (id == HT_CAPTYPE_IRQ)
+			if ((id & HT_5BIT_CAP_MASK) == HT_CAPTYPE_IRQ)
 				break;
 		}
 	}


Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From gregkh at suse.de  Wed Dec  6 22:12:04 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:12:04 -0800
Subject: [openib-general] patch
 pci-use-pci_find_ht_capability-in-drivers-pci-htirq.c.patch added to
 gregkh-2.6 tree
In-Reply-To: <20061122072623.ECBFE67C38@ozlabs.org>
Message-ID: <20061207061213.4F333A60A6D@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Use pci_find_ht_capability() in drivers/pci/htirq.c

to my gregkh-2.6 tree.  Its filename is

     pci-use-pci_find_ht_capability-in-drivers-pci-htirq.c.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael at ozlabs.org Tue Nov 21 23:26:32 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
CC: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W. Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:19 +1100
Subject: PCI: Use pci_find_ht_capability() in drivers/pci/htirq.c
Message-Id: <20061122072623.ECBFE67C38 at ozlabs.org>

Use pci_find_ht_capability() in drivers/pci/htirq.c

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 drivers/pci/htirq.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- gregkh-2.6.orig/drivers/pci/htirq.c
+++ gregkh-2.6/drivers/pci/htirq.c
@@ -99,14 +99,7 @@ int __ht_create_irq(struct pci_dev *dev,
 	int pos;
 	int irq;
 
-	pos = pci_find_capability(dev, PCI_CAP_ID_HT);
-	while (pos) {
-		u8 subtype;
-		pci_read_config_byte(dev, pos + 3, &subtype);
-		if (subtype == HT_CAPTYPE_IRQ)
-			break;
-		pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT);
-	}
+	pos = pci_find_ht_capability(dev, HT_CAPTYPE_IRQ);
 	if (!pos)
 		return -EINVAL;
 

Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From gregkh at suse.de  Wed Dec  6 22:11:57 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:11:57 -0800
Subject: [openib-general] patch
 pci-create-__pci_bus_find_cap_start-from-__pci_bus_find_cap.patch added to
 gregkh-2.6 tree
In-Reply-To: <20061122072621.BC4B967C35@ozlabs.org>
Message-ID: <20061207061206.7DDE6A606D8@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Create __pci_bus_find_cap_start() from __pci_bus_find_cap()

to my gregkh-2.6 tree.  Its filename is

     pci-create-__pci_bus_find_cap_start-from-__pci_bus_find_cap.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:32 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W.Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:16 +1100
Subject: PCI: Create __pci_bus_find_cap_start() from __pci_bus_find_cap()
Message-Id: <20061122072621.BC4B967C35 at ozlabs.org>

The current implementation of __pci_bus_find_cap() does two things,
first it determines the start of the capability chain for the device,
and then it trys to find the requested capability.

Split these out, so that we can use the two parts independantly in
a subsequent patch. Externally visible behaviour should be unchanged.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 drivers/pci/pci.c |   28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

--- gregkh-2.6.orig/drivers/pci/pci.c
+++ gregkh-2.6/drivers/pci/pci.c
@@ -96,10 +96,10 @@ int pci_find_next_capability(struct pci_
 }
 EXPORT_SYMBOL_GPL(pci_find_next_capability);
 
-static int __pci_bus_find_cap(struct pci_bus *bus, unsigned int devfn, u8 hdr_type, int cap)
+static int __pci_bus_find_cap_start(struct pci_bus *bus,
+				    unsigned int devfn, u8 hdr_type)
 {
 	u16 status;
-	u8 pos;
 
 	pci_bus_read_config_word(bus, devfn, PCI_STATUS, &status);
 	if (!(status & PCI_STATUS_CAP_LIST))
@@ -108,15 +108,14 @@ static int __pci_bus_find_cap(struct pci
 	switch (hdr_type) {
 	case PCI_HEADER_TYPE_NORMAL:
 	case PCI_HEADER_TYPE_BRIDGE:
-		pos = PCI_CAPABILITY_LIST;
-		break;
+		return PCI_CAPABILITY_LIST;
 	case PCI_HEADER_TYPE_CARDBUS:
-		pos = PCI_CB_CAPABILITY_LIST;
-		break;
+		return PCI_CB_CAPABILITY_LIST;
 	default:
 		return 0;
 	}
-	return __pci_find_next_cap(bus, devfn, pos, cap);
+
+	return 0;
 }
 
 /**
@@ -140,7 +139,13 @@ static int __pci_bus_find_cap(struct pci
  */
 int pci_find_capability(struct pci_dev *dev, int cap)
 {
-	return __pci_bus_find_cap(dev->bus, dev->devfn, dev->hdr_type, cap);
+	int pos;
+
+	pos = __pci_bus_find_cap_start(dev->bus, dev->devfn, dev->hdr_type);
+	if (pos)
+		pos = __pci_find_next_cap(dev->bus, dev->devfn, pos, cap);
+
+	return pos;
 }
 
 /**
@@ -158,11 +163,16 @@ int pci_find_capability(struct pci_dev *
  */
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap)
 {
+	int pos;
 	u8 hdr_type;
 
 	pci_bus_read_config_byte(bus, devfn, PCI_HEADER_TYPE, &hdr_type);
 
-	return __pci_bus_find_cap(bus, devfn, hdr_type & 0x7f, cap);
+	pos = __pci_bus_find_cap_start(bus, devfn, hdr_type & 0x7f);
+	if (pos)
+		pos = __pci_find_next_cap(bus, devfn, pos, cap);
+
+	return pos;
 }
 
 /**


Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From gregkh at suse.de  Wed Dec  6 22:12:08 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:12:08 -0800
Subject: [openib-general] patch
 pci-use-pci_find_ht_capability-in-drivers-pci-quirks.c.patch added to
 gregkh-2.6 tree
In-Reply-To: <20061122072625.8B07767C3B@ozlabs.org>
Message-ID: <20061207061216.C71179A88F5@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Use pci_find_ht_capability() in drivers/pci/quirks.c

to my gregkh-2.6 tree.  Its filename is

     pci-use-pci_find_ht_capability-in-drivers-pci-quirks.c.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael at ozlabs.org Tue Nov 21 23:26:32 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
CC: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W. Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:21 +1100
Subject: PCI: Use pci_find_ht_capability() in drivers/pci/quirks.c
Message-Id: <20061122072625.8B07767C3B at ozlabs.org>

Use pci_find_ht_capability() in drivers/pci/quirks.c.

I'm pretty sure the logic is unchanged here, but someone please eye-ball it
for me. I've changed the message to be a little shorter, it's now:

PCI: Found (enabled|disabled) HT MSI mapping on xxxx:xx:xx.x

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 drivers/pci/quirks.c |   28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

--- gregkh-2.6.orig/drivers/pci/quirks.c
+++ gregkh-2.6/drivers/pci/quirks.c
@@ -1644,19 +1644,23 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AM
  * return 1 if a HT MSI capability is found and enabled */
 static int __devinit msi_ht_cap_enabled(struct pci_dev *dev)
 {
-	u8 pos;
-	int ttl;
-	for (pos = pci_find_capability(dev, PCI_CAP_ID_HT), ttl = 48;
-	     pos && ttl;
-	     pos = pci_find_next_capability(dev, pos, PCI_CAP_ID_HT), ttl--) {
-		u32 cap_hdr;
-		/* MSI mapping section according to Hypertransport spec */
-		if (pci_read_config_dword(dev, pos, &cap_hdr) == 0
-		    && (cap_hdr & 0xf8000000) == 0xa8000000 /* MSI mapping */) {
-			printk(KERN_INFO "PCI: Found HT MSI mapping on %s with capability %s\n",
-			       pci_name(dev), cap_hdr & 0x10000 ? "enabled" : "disabled");
-			return (cap_hdr & 0x10000) != 0; /* MSI mapping cap enabled */
+	int pos, ttl = 48;
+
+	pos = pci_find_ht_capability(dev, HT_CAPTYPE_MSI_MAPPING);
+	while (pos && ttl--) {
+		u8 flags;
+
+		if (pci_read_config_byte(dev, pos + HT_MSI_FLAGS,
+					 &flags) == 0)
+		{
+			printk(KERN_INFO "PCI: Found %s HT MSI Mapping on %s\n",
+				flags & HT_MSI_FLAGS_ENABLE ?
+				"enabled" : "disabled", pci_name(dev));
+			return (flags & HT_MSI_FLAGS_ENABLE) != 0;
 		}
+
+		pos = pci_find_next_ht_capability(dev, pos,
+						  HT_CAPTYPE_MSI_MAPPING);
 	}
 	return 0;
 }


Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From gregkh at suse.de  Wed Dec  6 22:11:54 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:11:54 -0800
Subject: [openib-general] patch
 pci-add-pci_find_ht_capability-for-finding-hypertransport-capabilities.patch
 added to gregkh-2.6 tree
In-Reply-To: <20061122072622.E2A8967C37@ozlabs.org>
Message-ID: <20061207061202.977658A4C6C@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Add pci_find_ht_capability() for finding Hypertransport capabilities

to my gregkh-2.6 tree.  Its filename is

     pci-add-pci_find_ht_capability-for-finding-hypertransport-capabilities.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:37 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W.Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:18 +1100
Subject: PCI: Add pci_find_ht_capability() for finding Hypertransport capabilities
Message-Id: <20061122072622.E2A8967C37 at ozlabs.org>

From: Michael Ellerman <michael at ellerman.id.au>

There are already several places in the kernel that want to search a PCI
device for a given Hypertransport capability. Although this is possible
using pci_find_capability() etc., it makes sense to encapsulate that
logic in a helper - pci_find_ht_capability().

To cater for searching exhaustively for a capability, we also provide
pci_find_next_ht_capability().

We also need to cater for the fact that the HT capability fields may be
either 3 or 5 bits wide. pci_find_ht_capability() deals with this for you,
but callers using the #defines directly must handle that themselves.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 drivers/pci/pci.c        |   84 +++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/pci.h      |    2 +
 include/linux/pci_regs.h |   12 ++++++
 3 files changed, 94 insertions(+), 4 deletions(-)

--- gregkh-2.6.orig/drivers/pci/pci.c
+++ gregkh-2.6/drivers/pci/pci.c
@@ -68,12 +68,14 @@ pci_max_busnr(void)
 
 #endif  /*  0  */
 
-static int __pci_find_next_cap(struct pci_bus *bus, unsigned int devfn, u8 pos, int cap)
+#define PCI_FIND_CAP_TTL	48
+
+static int __pci_find_next_cap_ttl(struct pci_bus *bus, unsigned int devfn,
+				   u8 pos, int cap, int *ttl)
 {
 	u8 id;
-	int ttl = 48;
 
-	while (ttl--) {
+	while ((*ttl)--) {
 		pci_bus_read_config_byte(bus, devfn, pos, &pos);
 		if (pos < 0x40)
 			break;
@@ -89,6 +91,14 @@ static int __pci_find_next_cap(struct pc
 	return 0;
 }
 
+static int __pci_find_next_cap(struct pci_bus *bus, unsigned int devfn,
+			       u8 pos, int cap)
+{
+	int ttl = PCI_FIND_CAP_TTL;
+
+	return __pci_find_next_cap_ttl(bus, devfn, pos, cap, &ttl);
+}
+
 int pci_find_next_capability(struct pci_dev *dev, u8 pos, int cap)
 {
 	return __pci_find_next_cap(dev->bus, dev->devfn,
@@ -224,6 +234,74 @@ int pci_find_ext_capability(struct pci_d
 }
 EXPORT_SYMBOL_GPL(pci_find_ext_capability);
 
+static int __pci_find_next_ht_cap(struct pci_dev *dev, int pos, int ht_cap)
+{
+	int rc, ttl = PCI_FIND_CAP_TTL;
+	u8 cap, mask;
+
+	if (ht_cap == HT_CAPTYPE_SLAVE || ht_cap == HT_CAPTYPE_HOST)
+		mask = HT_3BIT_CAP_MASK;
+	else
+		mask = HT_5BIT_CAP_MASK;
+
+	pos = __pci_find_next_cap_ttl(dev->bus, dev->devfn, pos,
+				      PCI_CAP_ID_HT, &ttl);
+	while (pos) {
+		rc = pci_read_config_byte(dev, pos + 3, &cap);
+		if (rc != PCIBIOS_SUCCESSFUL)
+			return 0;
+
+		if ((cap & mask) == ht_cap)
+			return pos;
+
+		pos = __pci_find_next_cap_ttl(dev->bus, dev->devfn, pos,
+					      PCI_CAP_ID_HT, &ttl);
+	}
+
+	return 0;
+}
+/**
+ * pci_find_next_ht_capability - query a device's Hypertransport capabilities
+ * @dev: PCI device to query
+ * @pos: Position from which to continue searching
+ * @ht_cap: Hypertransport capability code
+ *
+ * To be used in conjunction with pci_find_ht_capability() to search for
+ * all capabilities matching @ht_cap. @pos should always be a value returned
+ * from pci_find_ht_capability().
+ *
+ * NB. To be 100% safe against broken PCI devices, the caller should take
+ * steps to avoid an infinite loop.
+ */
+int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap)
+{
+	return __pci_find_next_ht_cap(dev, pos + PCI_CAP_LIST_NEXT, ht_cap);
+}
+EXPORT_SYMBOL_GPL(pci_find_next_ht_capability);
+
+/**
+ * pci_find_ht_capability - query a device's Hypertransport capabilities
+ * @dev: PCI device to query
+ * @ht_cap: Hypertransport capability code
+ *
+ * Tell if a device supports a given Hypertransport capability.
+ * Returns an address within the device's PCI configuration space
+ * or 0 in case the device does not support the request capability.
+ * The address points to the PCI capability, of type PCI_CAP_ID_HT,
+ * which has a Hypertransport capability matching @ht_cap.
+ */
+int pci_find_ht_capability(struct pci_dev *dev, int ht_cap)
+{
+	int pos;
+
+	pos = __pci_bus_find_cap_start(dev->bus, dev->devfn, dev->hdr_type);
+	if (pos)
+		pos = __pci_find_next_ht_cap(dev, pos, ht_cap);
+
+	return pos;
+}
+EXPORT_SYMBOL_GPL(pci_find_ht_capability);
+
 /**
  * pci_find_parent_resource - return resource region of parent bus of given region
  * @dev: PCI device structure contains resources to be searched
--- gregkh-2.6.orig/include/linux/pci.h
+++ gregkh-2.6/include/linux/pci.h
@@ -454,6 +454,8 @@ struct pci_dev *pci_find_slot (unsigned 
 int pci_find_capability (struct pci_dev *dev, int cap);
 int pci_find_next_capability (struct pci_dev *dev, u8 pos, int cap);
 int pci_find_ext_capability (struct pci_dev *dev, int cap);
+int pci_find_ht_capability (struct pci_dev *dev, int ht_cap);
+int pci_find_next_ht_capability (struct pci_dev *dev, int pos, int ht_cap);
 struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
 
 struct pci_dev *pci_get_device(unsigned int vendor, unsigned int device,
--- gregkh-2.6.orig/include/linux/pci_regs.h
+++ gregkh-2.6/include/linux/pci_regs.h
@@ -475,9 +475,19 @@
 #define PCI_PWR_CAP		12	/* Capability */
 #define  PCI_PWR_CAP_BUDGET(x)	((x) & 1)	/* Included in system budget */
 
-/* Hypertransport sub capability types */
+/*
+ * Hypertransport sub capability types
+ *
+ * Unfortunately there are both 3 bit and 5 bit capability types defined
+ * in the HT spec, catering for that is a little messy. You probably don't
+ * want to use these directly, just use pci_find_ht_capability() and it
+ * will do the right thing for you.
+ */
+#define HT_3BIT_CAP_MASK	0xE0
 #define HT_CAPTYPE_SLAVE	0x00	/* Slave/Primary link configuration */
 #define HT_CAPTYPE_HOST		0x20	/* Host/Secondary link configuration */
+
+#define HT_5BIT_CAP_MASK	0xF8
 #define HT_CAPTYPE_IRQ		0x80	/* IRQ Configuration */
 #define HT_CAPTYPE_REMAPPING_40	0xA0	/* 40 bit address remapping */
 #define HT_CAPTYPE_REMAPPING_64 0xA2	/* 64 bit address remapping */


Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From gregkh at suse.de  Wed Dec  6 22:11:50 2006
From: gregkh at suse.de (gregkh at suse.de)
Date: Wed, 06 Dec 2006 22:11:50 -0800
Subject: [openib-general] patch
 pci-add-defines-for-hypertransport-msi-fields.patch added to gregkh-2.6
 tree
In-Reply-To: <20061122072624.B86A167C3A@ozlabs.org>
Message-ID: <20061207061158.F2300A60904@imap.suse.de>


This is a note to let you know that I've just added the patch titled

     Subject: PCI: Add #defines for Hypertransport MSI fields

to my gregkh-2.6 tree.  Its filename is

     pci-add-defines-for-hypertransport-msi-fields.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci at atrey.karlin.mff.cuni.cz Tue Nov 21 23:26:46 2006
From: Michael Ellerman <michael at ellerman.id.au>
To: linux-pci at atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg at kroah.com>, Benjamin Herrenschmidt <benh at kernel.crashing.org>, Eric W.Biederman <ebiederm at xmission.com>, Segher Boessenkool <segher at kernel.crashing.org>, <support at pathscale.com>, <openib-general at openib.org>, <brice at myri.com>
Date: Wed, 22 Nov 2006 18:26:20 +1100
Subject: PCI: Add #defines for Hypertransport MSI fields
Message-Id: <20061122072624.B86A167C3A at ozlabs.org>

Add a few #defines for grabbing and working with the address fields
in a HT_CAPTYPE_MSI_MAPPING capability. All from the HT spec v3.00.

Signed-off-by: Michael Ellerman <michael at ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh at suse.de>

---
 include/linux/pci_regs.h |    7 +++++++
 1 file changed, 7 insertions(+)

--- gregkh-2.6.orig/include/linux/pci_regs.h
+++ gregkh-2.6/include/linux/pci_regs.h
@@ -494,6 +494,13 @@
 #define HT_CAPTYPE_UNITID_CLUMP	0x90	/* Unit ID clumping */
 #define HT_CAPTYPE_EXTCONF	0x98	/* Extended Configuration Space Access */
 #define HT_CAPTYPE_MSI_MAPPING	0xA8	/* MSI Mapping Capability */
+#define  HT_MSI_FLAGS		0x02		/* Offset to flags */
+#define  HT_MSI_FLAGS_ENABLE	0x1		/* Mapping enable */
+#define  HT_MSI_FLAGS_FIXED	0x2		/* Fixed mapping only */
+#define  HT_MSI_FIXED_ADDR	0x00000000FEE00000ULL	/* Fixed addr */
+#define  HT_MSI_ADDR_LO		0x04		/* Offset to low addr bits */
+#define  HT_MSI_ADDR_LO_MASK	0xFFF00000	/* Low address bit mask */
+#define  HT_MSI_ADDR_HI		0x08		/* Offset to high addr bits */
 #define HT_CAPTYPE_DIRECT_ROUTE	0xB0	/* Direct routing configuration */
 #define HT_CAPTYPE_VCSET	0xB8	/* Virtual Channel configuration */
 #define HT_CAPTYPE_ERROR_RETRY	0xC0	/* Retry on error configuration */


Patches currently in gregkh-2.6 which might be from michael at ellerman.id.au are


From monil at voltaire.com  Wed Dec  6 22:17:39 2006
From: monil at voltaire.com (Moni Levy)
Date: Thu, 7 Dec 2006 08:17:39 +0200
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
 RDMA CM etc
In-Reply-To: <45770AA3.2040505@ichips.intel.com>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com>
	<20061206101705.GP26787@mellanox.co.il>
	<45770AA3.2040505@ichips.intel.com>
Message-ID: <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com>

Sean,
On 12/6/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
> >>>I gather the ucma bits are in rdma_ucm?
>
> Yes.
>
> Basically, I reworked changes that were in svn into separate branches based off
> of 2.6.19.
>
> > 1st is probably to fix the mcast bits so that they don't crash the machine.
> > OFED will be based on whatever is merged by Linus by that time + any number of patches
> > and out of kernel modules.
>
> Even if the kernel multicast support could make it into 2.6.20, I won't have the
> multicast changes to the rdma_cm done by then.
>
> >>3rd have Sean decide how he wants the multicast support to be integrated
> >>into OFED 1.2, my guess would be as a patch set over the
> >>ib_sa/ipoib/rdma_cm and rdma_ucm but its left for him to decide
>
> Does OFED want the multicast support in 1.2?

We definitely want the multicast support in 1.2. It's on the wiki (
OFED 1.2 release plan and features) and I understood that this was
also agreed on at SC06.

-- Moni

>
> > Maybe the right thing is to split the multicast stuff in a separate library,
> > or have a separate ABI version for multicast, I don't really know.
>
> My anticipation is that the multicast support will bump the ABI, but will allow
> backwards compatibility.  The break from librdmacm ABI 2 to ABI 3 is a result of
> changing the event reporting.
>
> - Sean
>
> _______________________________________________
> openfabrics-ewg mailing list
> openfabrics-ewg at openib.org
> http://openib.org/mailman/listinfo/openfabrics-ewg
>
>


From ogerlitz at voltaire.com  Wed Dec  6 23:22:49 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 07 Dec 2006 09:22:49 +0200
Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <1165430156.14800.243.camel@brick.pathscale.com>
References: <1165430156.14800.243.camel@brick.pathscale.com>
Message-ID: <4577C149.3050900@voltaire.com>

Ralph Campbell wrote:
> This version of the patch adds support for ib_dma_alloc_coherent()
> and ib_dma_free_coherent().  It also fixes the bug Or found in
> ipath_sync_single_for_cpu() and ipath_sync_single_for_device().

> This patch implements the interposing DMA mapping functions to allow
> support for IOMMUs and remove the dependence on phys_to_virt().

Haven't you said that the ipath driver uses bus_to_virt ?

> diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c
> --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> +++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Tue Dec 05 16:04:53 2006 -0800
> +/**
> + * ipath_dma_map_single - Map a kernel virtual address to DMA address
> + * @dev: The device for which the dma_addr is to be created
> + * @cpu_addr: The kernel virtual address
> + * @size: The size of the region in bytes
> + * @direction: The direction of the DMA
> + */
> +static u64 ipath_dma_map_single(struct ib_device *dev,
> +			        void *cpu_addr, size_t size,
> +			        enum dma_data_direction direction)
> +{
> +	BUG_ON(!valid_dma_direction(direction));
> +	return (u64) cpu_addr;
> +}

The documentation is both over kill in its volume and worse, simply 
tells a whole different story then what this code is doing. It does not 
generate DMA address, it does not care about the ib device nor the size 
or dma direction. Same for all the documentation below.

> +/**
> + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry
> + * @dev: The device for which the DMA addresses were created
> + * @sg: The scatter/gather entry
> + */
> +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
> +{
> +	return (u64) page_address(sg->page);
> +}

this is a bug, you need to add sg->offset

Or.


From mst at mellanox.co.il  Wed Dec  6 23:29:38 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Dec 2006 09:29:38 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9486@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9486@xmb-sjc-216.amer.cisco.com>
Message-ID: <20061207072938.GB26107@mellanox.co.il>

> Quoting r. Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>:
> Subject: RE: [openib-general] [PATCH] IPoIB CM Experimental support
> 
> > d. Limitations
> > UDP multicast and UDP connections to IPoIB UD mode
> > currently don't work since we get packets that are too large to
> > send over a UD QP.
> > As a work around, one can now create separate interfaces
> > for use with CM and UD mode.
> 
> You can't send UDP/multicast traffic at all between IPoIB CM and IPoIB
> UD?

With my experimental code, this currently works only if you manually limit the MTU
for multicast/UD addresses.
The simplest way to do this is to set up separate interfaces for CM and UD modes.

> What about UDP/multicast between IPoIB CM hosts?

As above.

-- 
MST


From mst at mellanox.co.il  Wed Dec  6 23:30:38 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Dec 2006 09:30:38 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <4577019C.7050900@ichips.intel.com>
References: <4577019C.7050900@ichips.intel.com>
Message-ID: <20061207073038.GC26107@mellanox.co.il>

> > Just to clarify this point - what connecton messages can be lost?
> > E.g. if the passive side does not get an RTU for a while, it will
> > retry the REP, won't it?  Diagram 12.9.6 seems to indicate so:
> > from REP Sent we should go to RTU timeout, Send REP and back to REP Sent.
> > Is this implemented?
> 
> REP retries are already implemented in the ib_cm.  This handles the case where 
> the RTU is repeatedly lost, but data is still received on the connection.

Yes, I've even observed this with SDP, but I'm not sure why this
happens. It seems that MADs are sometimes lost even in back to back
configurations. Any idea why?

-- 
MST


From ogerlitz at voltaire.com  Wed Dec  6 23:35:39 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 07 Dec 2006 09:35:39 +0200
Subject: [openib-general] [PATCH v2 1/7] IB/core - Add DMA mapping
 functions to allow device drivers to interpose
In-Reply-To: <1165428994.14800.229.camel@brick.pathscale.com>
References: <1164910957.14800.71.camel@brick.pathscale.com>
	<adamz68ofup.fsf@cisco.com>
	<1164918691.14800.101.camel@brick.pathscale.com>
	<15ddcffd0612010536j61335775nc4322c16f7f658f0@mail.gmail.com>
	<56586.71.131.5.186.1165005556.squirrel@rocky.pathscale.com>
	<adau00fky3p.fsf@cisco.com>
	<43595.71.131.5.186.1165019279.squirrel@rocky.pathscale.com>
	<aday7prje7f.fsf@cisco.com>
	<49336.71.131.5.186.1165025322.squirrel@rocky.pathscale.com>
	<adad56ydqpz.fsf@cisco.com>
	<15ddcffd0612051321i252c2312m542f9e9121eac4a8@mail.gmail.com>
	<1165359560.14800.210.camel@brick.pathscale.com>
	<4576AA73.105@voltaire.com>
	<1165428994.14800.229.camel@brick.pathscale.com>
Message-ID: <4577C44B.20305@voltaire.com>

Ralph Campbell wrote:
> On Wed, 2006-12-06 at 13:33 +0200, Or Gerlitz wrote:

>> Basically what Roland suggest is that you need to implement SW IOTLB 
>> mapping from dma_addr_t (possibly offset-ed) to kv addr. And do the 
>> actual kmap/unmap calls before/after you must touch the data.
>> Is this impossible?

> It is not impossible, just inefficient.  Why add a mapping
> table when it isn't needed?  If I needed to implement HIGMEM
> support, I would probably make "dma_addr_t" be a physical
> memory address, convert to PFN, find the struct page pointer,
> and call kmap_atomic() or page_address().  Why go though all
> that in the worst case CPU path when doing the conversion
> to kernel virtual address outside the critical path is
> feasible?

As i wrote you earlier on this thread, calling kmap_atomic **outside** 
the critical path (ie not when the low level ipath driver does an actual 
write/read to/from the page) is problematic b/c is means you hold a kmap 
atomic slots for long time which is something should not be done - eg 
see LDD 3rd edition pp 418 "your code must not sleep while holding a 
atomic kmap", on the other hand you can't just call kmap since you might 
be in non sleepable context (eg SCSI LLD such as SRP/iSER calling 
ib_dma_map_sg etc).

So you might be able to follow your approach of the physical --> pfn --> 
page --> kmap_atomic (I think you don't need to bother checking if 
page_address is NULL since kmap is a NO OP when the page is mapped), but 
  do it when you actually need the map.

Or.


From sean.hefty at intel.com  Thu Dec  7 00:45:49 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 7 Dec 2006 00:45:49 -0800
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <20061207073038.GC26107@mellanox.co.il>
Message-ID: <000001c719dc$185ac590$30cc180a@amr.corp.intel.com>

>Yes, I've even observed this with SDP, but I'm not sure why this
>happens. It seems that MADs are sometimes lost even in back to back
>configurations. Any idea why?

I have no idea why MADs would be lost.  In our scale up testing, we *never* saw
lost or dropped MADs to the SA node, even when hitting it with 500,000 queries.
The fact that you're seeing lost MADs is something that we should probably look
into more, someday, hopefully, when I have more time available...  We didn't
notice any issues with the CM messages in our testing, so we didn't examine that
traffic in more detail.

Are there counters for QP0/1 that can let us know whether drops are occurring on
the send or receive side?

- Sean


From boris at lfbs.RWTH-Aachen.DE  Thu Dec  7 01:26:28 2006
From: boris at lfbs.RWTH-Aachen.DE (Boris Bierbaum)
Date: Thu, 07 Dec 2006 10:26:28 +0100
Subject: [openib-general] Status of DAT conformance test
Message-ID: <4577DE44.5030308@lfbs.rwth-aachen.de>

Hi,

I'm looking for ways to test the standard conformance of a uDAPL
provider. I had a look at the DAT conformance test contained in the DAPL
reference implementation, release version gamma 3.2.

This test doesn't seem to be in a state in which it can be used to test
a uDAPL version 1.2 provider, is anybody working to fix this?

Which test programs can be recommaned for this purpose?

Thanks
Boris

-- 
|  _  RWTH | Boris Bierbaum
|_|_`_     | Lehrstuhl fuer Betriebssysteme
   | |_) _  | RWTH Aachen D-52056 Aachen
     |_)(_` | Tel: +49-241-80-27805
        ._) | Fax: +49-241-80-22339


From mst at mellanox.co.il  Thu Dec  7 01:54:18 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Dec 2006 11:54:18 +0200
Subject: [openib-general] [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early
 transition to RTS to handle lost CM messages
In-Reply-To: <000001c719dc$185ac590$30cc180a@amr.corp.intel.com>
References: <20061207073038.GC26107@mellanox.co.il>
	<000001c719dc$185ac590$30cc180a@amr.corp.intel.com>
Message-ID: <20061207095418.GA2614@mellanox.co.il>

> Quoting r. Sean Hefty <sean.hefty at intel.com>:
> Subject: Re: [PATCH 3/5 v3] 2.6.20 rdma/cma: allow early transition to RTS to handle lost CM messages
> 
> >Yes, I've even observed this with SDP, but I'm not sure why this
> >happens. It seems that MADs are sometimes lost even in back to back
> >configurations. Any idea why?
> 
> I have no idea why MADs would be lost.  In our scale up testing, we *never* saw
> lost or dropped MADs to the SA node, even when hitting it with 500,000 queries.
> The fact that you're seeing lost MADs is something that we should probably look
> into more, someday, hopefully, when I have more time available...  We didn't
> notice any issues with the CM messages in our testing, so we didn't examine that
> traffic in more detail.

Note I only see CM message drops.
I had to use rdma_establish and send an extra send after start in SDP
to trigger it, but path resolution was always working fine.

> Are there counters for QP0/1 that can let us know whether drops are occurring on
> the send or receive side?

Not sure what do you mean. Let's just count the send/receive completions in MAD layer.

-- 
MST


From poknam at gmail.com  Thu Dec  7 02:02:00 2006
From: poknam at gmail.com (Lai Dragonfly)
Date: Thu, 7 Dec 2006 18:02:00 +0800
Subject: [openib-general] Automatically connect to SRP target
Message-ID: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>

Hi all,

i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in clients and
IBGD-1.8.2-srpt in targets.
i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in openib.conf
,
i could not found the SRP target.
until i execute "srp_daemon -e -o", i can see all the targets appear in
/dev/sdX.

since i want to export the targets to other nodes,
any idea so that i can connect to the targets automatically in each reboot.
without typing "srp_daemon -e -o" each time?

thanks in advance.

PN
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061207/e5cfeeae/attachment.html>

From eitan at mellanox.co.il  Thu Dec  7 02:28:38 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 07 Dec 2006 12:28:38 +0200
Subject: [openib-general] osm: osmtest new flow of informinfo fails
Message-ID: <4577ECD6.2050906@mellanox.co.il>

Hi Hal,

All osmtest flows fail for me with the following error:
I start the log from the first inform info related message to give you 
the context.

Dec 07 11:48:17 656752 [B7FD48E0] -> osmtest_get_node_rec_by_lid: 
Getting node record for LID 0xFFFF
Dec 07 11:48:17 663592 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
Remote error:0x0C00 .
Dec 07 11:48:17 663642 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
Error on query (IB_REMOTE_ERROR)
Dec 07 11:48:17 663694 [B7FD48E0] -> osmtest_informinfo_request: ERR 
008F: ib_query failed (IB_REMOTE_ERROR)
Dec 07 11:48:17 663729 [B7FD48E0] -> osmtest_informinfo_request: Remote 
error = IB_MAD_STATUS_UNSUP_METHOD_ATTR
Dec 07 11:48:17 663759 [B7FD48E0] -> osmtest_informinfo_request: 
InformInfoRecord IS EXPECTED ERROR ^^^^
Dec 07 11:48:17 667671 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
Remote error:0x0C00 .
Dec 07 11:48:17 667705 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
Error on query (IB_REMOTE_ERROR)
Dec 07 11:48:17 667756 [B7FD48E0] -> osmtest_informinfo_request: ERR 
008F: ib_query failed (IB_REMOTE_ERROR)
Dec 07 11:48:17 667789 [B7FD48E0] -> osmtest_informinfo_request: Remote 
error = IB_MAD_STATUS_UNSUP_METHOD_ATTR
Dec 07 11:48:17 667820 [B7FD48E0] -> osmtest_informinfo_request: 
InformInfo IS EXPECTED ERROR ^^^^
Dec 07 11:48:17 669403 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
Remote error:0x0002 .
Dec 07 11:48:17 669436 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
Error on query (IB_REMOTE_ERROR)
Dec 07 11:48:17 669489 [B7FD48E0] -> osmtest_informinfo_request: ERR 
008F: ib_query failed (IB_REMOTE_ERROR)
Dec 07 11:48:17 669561 [B7FD48E0] -> osmtest_informinfo_request: Remote 
error = IB_SA_MAD_STATUS_REQ_INVALID
Dec 07 11:48:17 669590 [B7FD48E0] -> osmtest_informinfo_request: 
InformInfo UnSubscribe IS EXPECTED ERROR ^^^^
Dec 07 11:48:17 672731 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
Remote error:0x0002 .
Dec 07 11:48:17 672772 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
Error on query (IB_REMOTE_ERROR)
Dec 07 11:48:17 672826 [B7FD48E0] -> osmtest_informinfo_request: ERR 
008F: ib_query failed (IB_REMOTE_ERROR)
Dec 07 11:48:17 672859 [B7FD48E0] -> osmtest_informinfo_request: Remote 
error = IB_SA_MAD_STATUS_REQ_INVALID
Dec 07 11:48:17 672894 [B7FD48E0] -> osmtest_run: ERR 0146: SA 
validation database failure (IB_INSUFFICIENT_MEMORY)

OpenSM log says:
Dec 07 11:48:17 668513 [B57DABB0] -> osm_infr_rcv_process_set_method: 
ERR 4307: Failed to UnSubscribe to non existin
g inform object
Dec 07 11:48:17 671896 [B75DDBB0] -> osm_infr_rcv_process_set_method: 
ERR 4307: Failed to UnSubscribe to non existin
g inform object

Please let me know if you want me to debug it.

Eitan


From ramachandra.kuchimanchi at qlogic.com  Thu Dec  7 03:02:48 2006
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 07 Dec 2006 16:32:48 +0530
Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover from
 secondary path back to primary path
Message-ID: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com>

This fixes a bug due to which failover from secondary path back to primary path
was not working.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
---

 drivers/infiniband/ulp/vnic/vnic_ib.c   |    4 +++-
 drivers/infiniband/ulp/vnic/vnic_main.c |    9 +++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.c b/drivers/infiniband/ulp/vnic/vnic_ib.c
index 6196e20..56ae9f7 100644
--- a/drivers/infiniband/ulp/vnic/vnic_ib.c
+++ b/drivers/infiniband/ulp/vnic/vnic_ib.c
@@ -303,10 +303,12 @@ int vnic_ib_get_path(struct netpath *net
 			       " path record query\n",
 			       config->path_info.status);
 
-		netpath_timer(netpath, vnic->config->no_path_timeout);
 		ret = config->path_info.status;
 	}
 out:
+	if (ret)
+		netpath_timer(netpath, vnic->config->no_path_timeout);
+
 	return ret;
 }
 
diff --git a/drivers/infiniband/ulp/vnic/vnic_main.c b/drivers/infiniband/ulp/vnic/vnic_main.c
index fca2b90..e15d3f9 100644
--- a/drivers/infiniband/ulp/vnic/vnic_main.c
+++ b/drivers/infiniband/ulp/vnic/vnic_main.c
@@ -710,17 +710,18 @@ static struct vnic * vnic_handle_npevent
 	case VNIC_PRINP_TIMEREXPIRED:
 		netpath = &vnic->primary_path;
 		netpath->timer_state = NETPATH_TS_EXPIRED;
-		if (netpath->carrier)
+		if (!netpath->carrier)
 			update_path_and_reconnect(netpath, vnic);
 		break;
 	case VNIC_SECNP_TIMEREXPIRED:
 		netpath = &vnic->secondary_path;
 		netpath->timer_state = NETPATH_TS_EXPIRED;
-		if (netpath->carrier) {
+		if (!netpath->carrier)
+			update_path_and_reconnect(netpath, vnic);
+		else {
 			if (vnic->state == VNIC_UNINITIALIZED)
 				vnic_npevent_register(vnic, netpath);
-		} else
-			update_path_and_reconnect(netpath, vnic);
+		}
 		break;
 	case VNIC_PRINP_LINKUP:
 		vnic->primary_path.carrier = 1;


From ramachandra.kuchimanchi at qlogic.com  Thu Dec  7 03:03:30 2006
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra K)
Date: Thu, 07 Dec 2006 16:33:30 +0530
Subject: [openib-general] [PATCH 2/2 vex branch] IB/VNIC Fix failover delay
	issue
Message-ID: <4578425A.27226.250CE6A4@ramachandra.kuchimanchi.qlogic.com>

This reduces the delay in failover from one path to another.

When a path is lost, the control and data connections of that path
are cleaned up. As part of this a CM DREQ was being sent and we waited
for a DREP. During this time the viport thread was blocked which delayed
sending of a CONFIG_LINK request to the VEx for the other path. Due
to this, there was considerable delay in the failover path becoming
active. To fix this, send a DREQ but do not wait for a DREP from
the VEx. We need not worry about a DREQ being lost because the
VEx will anyway terminate a connection if it does not receive heartbeats.

Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
---

 drivers/infiniband/ulp/vnic/vnic_control.c |    4 +---
 drivers/infiniband/ulp/vnic/vnic_data.c    |    3 ---
 2 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/vnic/vnic_control.c b/drivers/infiniband/ulp/vnic/vnic_control.c
index b6a3e7f..2c55540 100644
--- a/drivers/infiniband/ulp/vnic/vnic_control.c
+++ b/drivers/infiniband/ulp/vnic/vnic_control.c
@@ -1450,12 +1450,10 @@ void control_cleanup(struct control *con
 {
 	CONTROL_FUNCTION("%s: control_disconnect()\n",
 			 control_ifcfg_name(control));
-	init_completion(&control->ib_conn.done);
 
 	if (ib_send_cm_dreq(control->ib_conn.cm_id, NULL, 0))
 		printk(KERN_DEBUG "control CM DREQ sending failed\n");
-	else
-		wait_for_completion(&control->ib_conn.done);
+
 	control_timer_stop(control);
 	ib_destroy_cm_id(control->ib_conn.cm_id);
 	ib_destroy_qp(control->ib_conn.qp);
diff --git a/drivers/infiniband/ulp/vnic/vnic_data.c b/drivers/infiniband/ulp/vnic/vnic_data.c
index 0ce81f3..c1d056a 100644
--- a/drivers/infiniband/ulp/vnic/vnic_data.c
+++ b/drivers/infiniband/ulp/vnic/vnic_data.c
@@ -666,11 +666,8 @@ void data_disconnect(struct data *data)
 
 void data_cleanup(struct data *data)
 {
-	init_completion(&data->ib_conn.done);
 	if (ib_send_cm_dreq(data->ib_conn.cm_id, NULL, 0))
 		printk(KERN_DEBUG "data CM DREQ sending failed\n");
-	else
-		wait_for_completion(&data->ib_conn.done);
 
 	ib_destroy_cm_id(data->ib_conn.cm_id);
 	ib_destroy_qp(data->ib_conn.qp);


From eeb at bartonsoftware.com  Thu Dec  7 03:04:22 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Thu, 7 Dec 2006 11:04:22 GMT
Subject: [openib-general] version #defines for the kernel
Message-ID: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>


Hi,

I found out there has been a change in the kernel ib_fmr_pool_map_phys() to pass the
last parameter by address rather than value.   I can cope with either version
with by coding...  

+#if IB_USER_VERBS_ABI_VERSION < 6
         fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool,
                                    tx->tx_pages, npages,
                                    &rd->rd_addr);
+#else
+        fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool,
+                                   tx->tx_pages, npages,
+                                   rd->rd_addr);
+#endif

...but is this the right thing to do?  It's the "USER" in
IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code.

Actually a single OFED version #define would most probably suit my purposes -
is that controversial?

-- 

                Cheers,
                        Eric


From ogerlitz at voltaire.com  Thu Dec  7 03:43:27 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 07 Dec 2006 13:43:27 +0200
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>
References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>
Message-ID: <4577FE5F.90407@voltaire.com>

Eric Barton wrote:
> Hi,
> 
> I found out there has been a change in the kernel ib_fmr_pool_map_phys() to pass the
> last parameter by address rather than value.   I can cope with either version
> with by coding...  
> 
> +#if IB_USER_VERBS_ABI_VERSION < 6
>          fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool,
>                                     tx->tx_pages, npages,
>                                     &rd->rd_addr);
> +#else
> +        fmr = ib_fmr_pool_map_phys(kiblnd_data.kib_fmrpool,
> +                                   tx->tx_pages, npages,
> +                                   rd->rd_addr);
> +#endif
> 
> ...but is this the right thing to do?  It's the "USER" in
> IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code.

Indeed, it has nothing to do with user/kernel ABI, the FMR verbs are 
only exposed to kernel space consumers same for the FMR pool.

The ib_fmr_pool_map_phys api change was done in the 2.6.18 cycle

Or.


From halr at voltaire.com  Thu Dec  7 04:07:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 07:07:43 -0500
Subject: [openib-general] osm: osmtest new flow of informinfo fails
In-Reply-To: <4577ECD6.2050906@mellanox.co.il>
References: <4577ECD6.2050906@mellanox.co.il>
Message-ID: <1165493190.25587.182601.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-07 at 05:28, Eitan Zahavi wrote:
> Hi Hal,
> 
> All osmtest flows fail for me with the following error:

By all flows, you mean osmtest -a (the all flows test).

> I start the log from the first inform info related message to give you 
> the context.
> 
> Dec 07 11:48:17 656752 [B7FD48E0] -> osmtest_get_node_rec_by_lid: 
> Getting node record for LID 0xFFFF
> Dec 07 11:48:17 663592 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
> Remote error:0x0C00 .
> Dec 07 11:48:17 663642 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
> Error on query (IB_REMOTE_ERROR)
> Dec 07 11:48:17 663694 [B7FD48E0] -> osmtest_informinfo_request: ERR 
> 008F: ib_query failed (IB_REMOTE_ERROR)
> Dec 07 11:48:17 663729 [B7FD48E0] -> osmtest_informinfo_request: Remote 
> error = IB_MAD_STATUS_UNSUP_METHOD_ATTR
> Dec 07 11:48:17 663759 [B7FD48E0] -> osmtest_informinfo_request: 
> InformInfoRecord IS EXPECTED ERROR ^^^^
> Dec 07 11:48:17 667671 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
> Remote error:0x0C00 .
> Dec 07 11:48:17 667705 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
> Error on query (IB_REMOTE_ERROR)
> Dec 07 11:48:17 667756 [B7FD48E0] -> osmtest_informinfo_request: ERR 
> 008F: ib_query failed (IB_REMOTE_ERROR)
> Dec 07 11:48:17 667789 [B7FD48E0] -> osmtest_informinfo_request: Remote 
> error = IB_MAD_STATUS_UNSUP_METHOD_ATTR
> Dec 07 11:48:17 667820 [B7FD48E0] -> osmtest_informinfo_request: 
> InformInfo IS EXPECTED ERROR ^^^^
> Dec 07 11:48:17 669403 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
> Remote error:0x0002 .
> Dec 07 11:48:17 669436 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
> Error on query (IB_REMOTE_ERROR)
> Dec 07 11:48:17 669489 [B7FD48E0] -> osmtest_informinfo_request: ERR 
> 008F: ib_query failed (IB_REMOTE_ERROR)
> Dec 07 11:48:17 669561 [B7FD48E0] -> osmtest_informinfo_request: Remote 
> error = IB_SA_MAD_STATUS_REQ_INVALID
> Dec 07 11:48:17 669590 [B7FD48E0] -> osmtest_informinfo_request: 
> InformInfo UnSubscribe IS EXPECTED ERROR ^^^^
> Dec 07 11:48:17 672731 [B6BD1BB0] -> __osmv_sa_mad_rcv_cb: ERR 0501: 
> Remote error:0x0002 .
> Dec 07 11:48:17 672772 [B6BD1BB0] -> osmtest_query_res_cb: ERR 0003: 
> Error on query (IB_REMOTE_ERROR)
> Dec 07 11:48:17 672826 [B7FD48E0] -> osmtest_informinfo_request: ERR 
> 008F: ib_query failed (IB_REMOTE_ERROR)
> Dec 07 11:48:17 672859 [B7FD48E0] -> osmtest_informinfo_request: Remote 
> error = IB_SA_MAD_STATUS_REQ_INVALID
> Dec 07 11:48:17 672894 [B7FD48E0] -> osmtest_run: ERR 0146: SA 
> validation database failure (IB_INSUFFICIENT_MEMORY)

This is a failure of the first subscribe.

> OpenSM log says:
> Dec 07 11:48:17 668513 [B57DABB0] -> osm_infr_rcv_process_set_method: 
> ERR 4307: Failed to UnSubscribe to non existin
> g inform object
> Dec 07 11:48:17 671896 [B75DDBB0] -> osm_infr_rcv_process_set_method: 
> ERR 4307: Failed to UnSubscribe to non existin
> g inform object

The first one is correct. The second one is due to bad treatment on the
valid subscribe. Evidently, it is now somehow being treated as an
unsubscribe rather than a subscribe. Can you run opensm with -V to see
all the log messages which will give a better indication of what path it
is taking in osm_infr_rcv_process_set_method. Thanks.

> Please let me know if you want me to debug it.

This works for me. Not sure what is different.

-- Hal

> Eitan


From shubbell at dbresearch.net  Thu Dec  7 04:08:37 2006
From: shubbell at dbresearch.net (Sean Hubbell)
Date: Thu, 07 Dec 2006 06:08:37 -0600
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <1165441086.25587.144751.camel@hal.voltaire.com>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
	<4577108F.9080308@dbresearch.net>
	<1165435407.25587.141052.camel@hal.voltaire.com>
	<457730C5.9000902@dbresearch.net>
	<1165441086.25587.144751.camel@hal.voltaire.com>
Message-ID: <45780445.5010200@dbresearch.net>

Hal Rosenstock wrote:
> On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote:
>   
>> Hal Rosenstock wrote:
>>     
>>> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote:
>>>   
>>>       
>>>> Hal Rosenstock wrote:
>>>>     
>>>>         
>>>>> Hi Sean,
>>>>>
>>>>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
>>>>>   
>>>>>       
>>>>>           
>>>>>> Hello,
>>>>>>
>>>>>>   I was testing our code and noticed that when I send data using 
>>>>>> multicast over our ib0 interface, all of the infiniband switches route 
>>>>>> the data to each switch and each node instead of a node that has an 
>>>>>> application listening to the interface like Ethernet. Is this by design?
>>>>>>     
>>>>>>         
>>>>>>             
>>>>> It depends on what multicast group is being used and which end nodes
>>>>> have registered for that group as to where the data is routed.
>>>>>
>>>>> -- Hal
>>>>>   
>>>>>       
>>>>>           
>>>> Hey Hal,
>>>>
>>>>   The multicast group I am sending data to is 224.10.10.x (not 
>>>> 224.0.0.x) and I have no clients / nodes listening but the data is still 
>>>> being sent.
>>>>     
>>>>         
>>> Yes, if there is only a sender, the data should not be routed anywhere.
>>>
>>>   
>>>       
>>>>  I am using wwtop from warewulf to view the network load for 
>>>> each node.
>>>>     
>>>>         
>>> I'm not familiar with those tools.
>>>
>>>   
>>>       
>>>>  Does this make sense?
>>>>     
>>>>         
>>> Nope. To state the obvious, something is not as it seems...
>>>
>>> Can you state which SM you are using ?
>>>
>>> Also, can you do the following:
>>> saquery -g
>>> saquery -m
>>> and send me the output.
>>>
>>> I may have some more experiments once I get that level of info.
>>>
>>> -- Hal
>>>   
>>>       
>> We have a Voltaire HW subnet manager. I do not have the saquery command. 
>> I'll have to find this and install it.
>>     
>
> What is running on your end nodes ? Is it OpenIB/OFED or something else
> ? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2
> supports the options I mentioned.
>
>   
>>  Would the web interface help?
>>     
>
> Not sure whether there is anything there for this.
>
> -- Hal
>
>   
>> Sean
>>     
>
>
>   
Hal,

  Here are the results:

The result of saquery -g on our head node:

[root at neptune ~]# saquery -g

MCMemberRecord group dump:
                                
MGID....................0xff12401bffff0000 : 0x00000000ffffffff
                                Mlid....................0xC000
                                Mtu.....................0x4
                                pkey....................0xFFFF
                                Rate....................0x3

MCMemberRecord group dump:
                                
MGID....................0xff12401bffff0000 : 0x0000000000000001
                                Mlid....................0xC001
                                Mtu.....................0x4
                                pkey....................0xFFFF
                                Rate....................0x3

The result of saquery -m on our root node:

Query SA failed: IB_TIMEOUT

Running package openib-diags-1.1.0-0

Sean


From halr at voltaire.com  Thu Dec  7 04:24:01 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 07:24:01 -0500
Subject: [openib-general] Multicast Group Routing Question
In-Reply-To: <45780445.5010200@dbresearch.net>
References: <45770372.8010700@dbresearch.net>
	<1165429589.25587.136986.camel@hal.voltaire.com>
	<4577108F.9080308@dbresearch.net>
	<1165435407.25587.141052.camel@hal.voltaire.com>
	<457730C5.9000902@dbresearch.net>
	<1165441086.25587.144751.camel@hal.voltaire.com>
	<45780445.5010200@dbresearch.net>
Message-ID: <1165494213.25587.183122.camel@hal.voltaire.com>

On Thu, 2006-12-07 at 07:08, Sean Hubbell wrote:
> Hal Rosenstock wrote:
> > On Wed, 2006-12-06 at 16:06, Sean Hubbell wrote:
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> On Wed, 2006-12-06 at 13:48, Sean Hubbell wrote:
> >>>   
> >>>       
> >>>> Hal Rosenstock wrote:
> >>>>     
> >>>>         
> >>>>> Hi Sean,
> >>>>>
> >>>>> On Wed, 2006-12-06 at 12:52, Sean Hubbell wrote:
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>>> Hello,
> >>>>>>
> >>>>>>   I was testing our code and noticed that when I send data using 
> >>>>>> multicast over our ib0 interface, all of the infiniband switches route 
> >>>>>> the data to each switch and each node instead of a node that has an 
> >>>>>> application listening to the interface like Ethernet. Is this by design?
> >>>>>>     
> >>>>>>         
> >>>>>>             
> >>>>> It depends on what multicast group is being used and which end nodes
> >>>>> have registered for that group as to where the data is routed.
> >>>>>
> >>>>> -- Hal
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> Hey Hal,
> >>>>
> >>>>   The multicast group I am sending data to is 224.10.10.x (not 
> >>>> 224.0.0.x) and I have no clients / nodes listening but the data is still 
> >>>> being sent.
> >>>>     
> >>>>         
> >>> Yes, if there is only a sender, the data should not be routed anywhere.
> >>>
> >>>   
> >>>       
> >>>>  I am using wwtop from warewulf to view the network load for 
> >>>> each node.
> >>>>     
> >>>>         
> >>> I'm not familiar with those tools.
> >>>
> >>>   
> >>>       
> >>>>  Does this make sense?
> >>>>     
> >>>>         
> >>> Nope. To state the obvious, something is not as it seems...
> >>>
> >>> Can you state which SM you are using ?
> >>>
> >>> Also, can you do the following:
> >>> saquery -g
> >>> saquery -m
> >>> and send me the output.
> >>>
> >>> I may have some more experiments once I get that level of info.
> >>>
> >>> -- Hal
> >>>   
> >>>       
> >> We have a Voltaire HW subnet manager. I do not have the saquery command. 
> >> I'll have to find this and install it.
> >>     
> >
> > What is running on your end nodes ? Is it OpenIB/OFED or something else
> > ? If it is OpenIB/OFED, saquery should be there. I think OFED 1.2
> > supports the options I mentioned.
> >
> >   
> >>  Would the web interface help?
> >>     
> >
> > Not sure whether there is anything there for this.
> >
> > -- Hal
> >
> >   
> >> Sean
> >>     
> >
> >
> >   
> Hal,
> 
>   Here are the results:
> 
> The result of saquery -g on our head node:
> 
> [root at neptune ~]# saquery -g
> 
> MCMemberRecord group dump:
>                                 
> MGID....................0xff12401bffff0000 : 0x00000000ffffffff
>                                 Mlid....................0xC000
>                                 Mtu.....................0x4
>                                 pkey....................0xFFFF
>                                 Rate....................0x3
> 
> MCMemberRecord group dump:
>                                 
> MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                 Mlid....................0xC001
>                                 Mtu.....................0x4
>                                 pkey....................0xFFFF
>                                 Rate....................0x3

I don't see the mgrp for 224.10.10.x here.

> The result of saquery -m on our root node:
> 
> Query SA failed: IB_TIMEOUT

This failure can be valid and is SM dependent.

-- Hal

> Running package openib-diags-1.1.0-0
> 
> Sean


From mst at mellanox.co.il  Thu Dec  7 05:29:56 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Dec 2006 15:29:56 +0200
Subject: [openib-general] potential multicast module issue (was Fwd: FW:
 IPoIB on c0-6 and c0-7: problem creating mcgroup)
Message-ID: <20061207132956.GC2614@mellanox.co.il>

Sean, FYI.

----- Forwarded message from Yohad Dickman <yohadd at mellanox.co.il> -----

Subject: FW: IPoIB on c0-6 and c0-7: problem creating mcgroup
Date: Thu, 7 Dec 2006 10:07:01 +0200
From: Yohad Dickman <yohadd at mellanox.co.il>

Hi Michael,
 
Yesterday, when I ran regression on the gen2_devel driver with the multicast patches, the opensm got an errors on multicast join (described below).
 
Can you check it?
 
Thx,
Yohad
 
-----Original Message-----
From: Yevgeny Kliteynik 
Sent: Wednesday, December 06, 2006 7:03 PM
To: Yohad Dickman; Yevgeny Kliteynik
Subject: IPoIB on c0-6 and c0-7: problem creating mcgroup


c0-7 (port 2) is trying to create mgroup, but the component mask is missing some bits:
 
__osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, 
scope_state = 0x1, 
component mask = 0x0000000000010083, 
expected comp mask = 0x00000000000130c7, 
MGID: 0xff12601bffff0000 : 0x0000000000000002 
from port 0x0002c90200209622
 
Missing bits in component mask for creating mcgroup:
 
IB_MCR_COMPMASK_QKEY    
IB_MCR_COMPMASK_TCLASS  
IB_MCR_COMPMASK_SL      
IB_MCR_COMPMASK_FLOW    
   
   
Regards,
 
Yevgeny Kliteynik
 
Mellanox Technologies LTD
Tel: +972-4-909-7200 ext: 394
Fax: +972-4-959-3245
P.O. Box 586 Yokneam 20692 ISRAEL 
 

----- End forwarded message -----

-- 
MST


From halr at voltaire.com  Thu Dec  7 05:33:39 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 08:33:39 -0500
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or
 switch]
In-Reply-To: <4576C33C.7050204@mellanox.co.il>
References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il>
Message-ID: <1165498375.25587.185801.camel@hal.voltaire.com>

On Wed, 2006-12-06 at 08:18, Eitan Zahavi wrote:
> Hi Hal,
> 
> Here is the same patch against GIT for your convenience.
> 
> Thanks
> 
> EZ
> 
> The simulated regression caught this:
> The osm.mcfdbs have now the format:
> Switch 0x0002c90000000006
> LID    : Out Port(s)
> 0xC000 : 0x003  0x004  0x005  0x006
> 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 
> :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 
> :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B 
> :0xC01C :0xC01D :0xC01E :0xC01F :
> 
> Which should probably just be:
> Switch 0x0002c90000000006
> LID    : Out Port(s)
> 0xC000 : 0x003  0x004  0x005  0x006
> 
> Actually switches that do not have any MCG entry will not be included
> in the dump file.
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied.

-- Hal


From mst at mellanox.co.il  Thu Dec  7 07:03:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 7 Dec 2006 17:03:04 +0200
Subject: [openib-general] [PATCH untested] mthca: map all MTTs/MPTs for FMR
	on 64 bit
Message-ID: <20061207150304.GD2614@mellanox.co.il>

We currently reserve separate MPT and MTT space for FMRs so avoid
abusing the vmalloc space on 32 bit systems. No such problem exists
on 64 bit systems so let's not do it there.

This mapping will also make writing MTTs for regular regions directly from driver
easier in the future.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

Roland, this is untested. Could you take a look please?

diff --git a/drivers/infiniband/hw/mthca/mthca_mr.c b/drivers/infiniband/hw/mthca/mthca_mr.c
index f71ffa8..3064002 100644
--- a/drivers/infiniband/hw/mthca/mthca_mr.c
+++ b/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -761,7 +761,7 @@ void mthca_arbel_fmr_unmap(struct mthca_dev *dev, struct mthca_fmr *fmr)
 int mthca_init_mr_table(struct mthca_dev *dev)
 {
 	unsigned long addr;
-	int err, i;
+	int mpts, mtts, err, i;
 
 	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
 			       dev->limits.num_mpts,
@@ -789,19 +789,26 @@ int mthca_init_mr_table(struct mthca_dev *dev)
 
 	if (dev->limits.fmr_reserved_mtts) {
 		i = fls(dev->limits.fmr_reserved_mtts - 1);
-
 		if (i >= 31) {
 			mthca_warn(dev, "Unable to reserve 2^31 FMR MTTs.\n");
 			err = -EINVAL;
 			goto err_fmr_mpt;
 		}
+		mpts = mtts = 1 << i;
+	} else {
+		mpts = dev->limits.num_mtt_segs;
+		mtts = dev->limits.num_mpts;
+	}
+
+	if (!mthca_is_memfree(dev) &&
+	    (dev->mthca_flags & MTHCA_FLAG_FMR)) {
 
 		addr = pci_resource_start(dev->pdev, 4) +
 			((pci_resource_len(dev->pdev, 4) - 1) &
 			 dev->mr_table.mpt_base);
 
 		dev->mr_table.tavor_fmr.mpt_base =
-			ioremap(addr, (1 << i) * sizeof(struct mthca_mpt_entry));
+			ioremap(addr, mpts * sizeof(struct mthca_mpt_entry));
 
 		if (!dev->mr_table.tavor_fmr.mpt_base) {
 			mthca_warn(dev, "MPT ioremap for FMR failed.\n");
@@ -814,19 +821,21 @@ int mthca_init_mr_table(struct mthca_dev *dev)
 			 dev->mr_table.mtt_base);
 
 		dev->mr_table.tavor_fmr.mtt_base =
-			ioremap(addr, (1 << i) * MTHCA_MTT_SEG_SIZE);
+			ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE);
 		if (!dev->mr_table.tavor_fmr.mtt_base) {
 			mthca_warn(dev, "MTT ioremap for FMR failed.\n");
 			err = -ENOMEM;
 			goto err_fmr_mtt;
 		}
+	}
 
-		err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i);
+	if (dev->limits.fmr_reserved_mtts) {
+		err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, mtts);
 		if (err)
 			goto err_fmr_mtt_buddy;
 
 		/* Prevent regular MRs from using FMR keys */
-		err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i);
+		err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, mtts);
 		if (err)
 			goto err_reserve_fmr;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_profile.c b/drivers/infiniband/hw/mthca/mthca_profile.c
index 58d44aa..26bf86d 100644
--- a/drivers/infiniband/hw/mthca/mthca_profile.c
+++ b/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -277,7 +277,7 @@ u64 mthca_make_profile(struct mthca_dev *dev,
 	 * out of the MR pool. They don't use additional memory, but
 	 * we assign them as part of the HCA profile anyway.
 	 */
-	if (mthca_is_memfree(dev))
+	if (mthca_is_memfree(dev) || BITS_PER_LONG == 64)
 		dev->limits.fmr_reserved_mtts = 0;
 	else
 		dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;


-- 
MST


From eitan at mellanox.co.il  Thu Dec  7 07:06:57 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 07 Dec 2006 17:06:57 +0200
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or
 switch]
In-Reply-To: <1165498375.25587.185801.camel@hal.voltaire.com>
References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il>
	<1165498375.25587.185801.camel@hal.voltaire.com>
Message-ID: <45782E11.5010400@mellanox.co.il>

Hi Hal,

Great thanks.
Applying the patch helps.
Now I am stuck behind another issue introduced by the latest patch of 
incremental Set(LFT).
I will describe it in separate mail

Eitan

Hal Rosenstock wrote:
> On Wed, 2006-12-06 at 08:18, Eitan Zahavi wrote:
>   
>> Hi Hal,
>>
>> Here is the same patch against GIT for your convenience.
>>
>> Thanks
>>
>> EZ
>>
>> The simulated regression caught this:
>> The osm.mcfdbs have now the format:
>> Switch 0x0002c90000000006
>> LID    : Out Port(s)
>> 0xC000 : 0x003  0x004  0x005  0x006
>> 0xC001 :0xC002 :0xC003 :0xC004 :0xC005 :0xC006 :0xC007 :0xC008 :0xC009 
>> :0xC00A :0xC00B :0xC00C :0xC00D :0xC00E :0xC00F :0xC010 :0xC011 :0xC012 
>> :0xC013 :0xC014 :0xC015 :0xC016 :0xC017 :0xC018 :0xC019 :0xC01A :0xC01B 
>> :0xC01C :0xC01D :0xC01E :0xC01F :
>>
>> Which should probably just be:
>> Switch 0x0002c90000000006
>> LID    : Out Port(s)
>> 0xC000 : 0x003  0x004  0x005  0x006
>>
>> Actually switches that do not have any MCG entry will not be included
>> in the dump file.
>>
>> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>>     
>
> Thanks. Applied.
>
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Thu Dec  7 07:12:59 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 07 Dec 2006 17:12:59 +0200
Subject: [openib-general] [PATCH] osm: Routing Tables are full of
 UNREACHABLE instead of real route
Message-ID: <45782F7B.1010408@mellanox.co.il>

Hi Hal,

I resolved the mystery behind the osm.fdbs that is now full of 
UNREACHABLE instead of correct out ports.

The problem is a consequence of the new code that does not use the 
switch LFT blocks for the intermediate LFT assignments:
The idea of having incremental updates only relies on temporary buffer 
that the routing algorithm fills.
Then it is sent to the wire only if there is a diff between the switch 
LFT tables (from the SMDB) and the temporary buffer.

So the switch LFT tables are not being directly updated by the routing 
algorithm - but only by the GetResp obtained as
reply to the setting. Until this stage of the description - everything 
looks right.

But what is wrong is that the dump of LFT tables is invoked before the 
GetResp is obtained.
So if only a single sweep is invoked the resulting osm.fdbs show the 
original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE.

The patch below is taking the easy way and should be probably revisited. 
Instead of having a separate algorithm step for dumping out the 
resulting GetResp data after all LFT responses were obtained it just 
copies the sent LFT blocks to the SMDB.

I think we need to have at least this simple patch until we have the 
dump move to a new algorithm step.

Thanks
Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
=====================================================================

diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 5a55da8..3a62c7f 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table(
                "osm_ucast_mgr_set_fwd_table: ERR 3A05: "
                "Sending linear fwd. tbl. block failed (%s)\n",
                ib_get_err_str( status ) );
-    }
+    } else {
+       /*
+         HACK: for now we will assume we succeeded to send
+         and set the local DB based on it. This should allow
+         us to immediatly dump out our routing
+       */
+       osm_switch_set_ft_block(
+          p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho);
+        }
   }

   OSM_LOG_EXIT( p_mgr->p_log );


From tziporet at dev.mellanox.co.il  Thu Dec  7 07:35:59 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 07 Dec 2006 17:35:59 +0200
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
 RDMA CM etc
In-Reply-To: <6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com>
	<20061206101705.GP26787@mellanox.co.il>
	<45770AA3.2040505@ichips.intel.com>
	<6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com>
Message-ID: <457834DF.7030400@dev.mellanox.co.il>

Moni Levy wrote:
>>
>> Does OFED want the multicast support in 1.2?
>>     
>
> We definitely want the multicast support in 1.2. It's on the wiki (
> OFED 1.2 release plan and features) and I understood that this was
> also agreed on at SC06.
>
> -- Moni
>
>   

We want it but it must work properly before we can integrate it.
Moni - you suggested help in debugging - can you take the test that 
Dotan submitted and debug to understand the failure?

Thanks,
Tziporet


From tziporet at mellanox.co.il  Thu Dec  7 08:16:03 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Thu, 7 Dec 2006 18:16:03 +0200
Subject: [openib-general] HPCC benchmark aborts at MPIRandomAccess test
Message-ID: <6C2C79E72C305246B504CBA17B5500C9521A21@mtlexch01.mtl.com>

openib-general at openib.org


-----Original Message-----
From: Brian Sparks 
Sent: Thursday, December 07, 2006 6:01 PM
To: Tziporet Koren
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

Which list...

Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Tziporet Koren 
Sent: Thursday, December 07, 2006 8:01 AM
To: Brian Sparks
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

I can't help - you can send an email to openib mailing list - maybe
someone there can help

-----Original Message-----
From: Brian Sparks 
Sent: Thursday, December 07, 2006 5:46 PM
To: Tziporet Koren
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

I still get an invalid cert

Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Tziporet Koren 
Sent: Thursday, December 07, 2006 1:05 AM
To: Brian Sparks; Boris Shpolyansky; Aviram Gutman; Sujal Das
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

That's strange. Can you try the following instead:
Go to https://openib.org/tiki/tiki-index.php and then press the link to
OFED support.

Or - instead of clicking the link do copy & paste to the web browser.

Tziporet

-----Original Message-----
From: Brian Sparks 
Sent: Wednesday, December 06, 2006 6:12 PM
To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

Just going to that link gives me a certificate error.


Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Tziporet Koren 
Sent: Wednesday, December 06, 2006 12:44 AM
To: Brian Sparks; Boris Shpolyansky; Aviram Gutman; Sujal Das
Cc: FAE; Thad Omura; Hani Salloum
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

What have you tried to do - just read or edit?
If you want to edit the file you need to register first, and after you
have an account you can login and edit the page.

Tziporet


-----Original Message-----
From: Brian Sparks 
Sent: Tuesday, December 05, 2006 9:50 PM
To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das
Cc: FAE; Thad Omura; Hani Salloum
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

I get a certificate error

Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Tziporet Koren 
Sent: Tuesday, December 05, 2006 11:48 AM
To: Boris Shpolyansky; Brian Sparks; Aviram Gutman; Sujal Das
Cc: FAE; Thad Omura; Hani Salloum
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

Boris,
The support page can be improved by anyone.
So you are welcome to edit it and make it better. (all you need is to
login to the Wiki and it's pretty intuitive to edit)
If you can't please send me the input and I will try to improve it.

All - please review the support page and suggest what should be
added/changed:
https://openib.org/tiki/tiki-index.php?page=OFED+Support

Thanks,
Tziporet

-----Original Message-----
From: Boris Shpolyansky 
Sent: Tuesday, December 05, 2006 9:29 PM
To: Brian Sparks; Tziporet Koren; Aviram Gutman; Sujal Das
Cc: FAE; Thad Omura
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

I still strongly believe we should have those instructions available.
I'm perfectly fine with having them on OpenFabrics web site with us
providing a link to them from our web site. We should drive this and if
needed put together this page and maintain it as a "service to open
source community" - without taking sole support responsibility.

Will appreciate everybody's comments.

Boris. 

-----Original Message-----
From: Brian Sparks 
Sent: Tuesday, December 05, 2006 8:06 AM
To: Boris Shpolyansky; Tziporet Koren; Aviram Gutman; Sujal Das
Cc: FAE; Thad Omura
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

The OFED support pages should be managed on the OF site. 
Because it's a community source stack, we should not take sole support
responsibility and have it delegated to our site. 

Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Boris Shpolyansky
Sent: Monday, December 04, 2006 5:37 PM
To: Tziporet Koren; Aviram Gutman; Sujal Das; Brian Sparks
Cc: FAE
Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

 Hi,

As far as I could check we do not have a good support page for OFED
stack (as we used to have for IBGD). The one I found on OpenIB wiki is
crappy and doesn't look professional at all.
I believe we need to set up such page either on our web-site or on
OpenIB with a link to it from the main (home) page of both sites. It
should have links to all relevant documents, code download, known issues
and recent patches with clear instructions how to apply those.

I'm adding an example of the patch instructions I just sent to Sun.

Please, comment.
Boris.


Here is the procedure you need to follow in order to apply the patch I
sent you earlier:
 
1. Patch the source code 
 
tar xvfz OFED-1.1.tgz package
cd OFED-1.1/SOURCES
tar xvfz mpi_osu-0.9.7-mlx2.2.0.tgz
cd mpi_osu-0.9.7-mlx2.2.0
cp <where the patch resides>/smpi_cancel.patch .
echo "smpi_cancel.patch" >> patch.lst
cd ..
tar cvfz mpi_osu-0.9.7-mlx2.2.0.tgz mpi_osu-0.9.7-mlx2.2.0 cd ..
 
2. Build OSU MPI (MVAPICH) RPM
 
./build.sh
    - choose option 2 "Build InfiniBand Software RPMs"
    - then choose option 4 "Customize"
    - then answer yes on "mpi_osu"
 
The new RPM will go to OFED/RPMS directory.
 
3. Install newly built RPM
 
- either with ./install.sh script - you'll have to make sure to mark all
needed components or to install with "-c" option using correct ofed.conf
file
- or using "rpm -Uhv" command

-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Boris
Shpolyansky
Sent: Monday, December 04, 2006 2:30 PM
To: Tziporet Koren
Cc: David Costa; Robert Houk; Thomas Babbit; Anthony Vinciguerra;
openib-general at openib.org
Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

I guess we need to have all our recent MPI fixes to be added to the
support page.
Pasha should keep track of those, including the one I sent to Sun.

By the way, where is this support page exactly - on our web site ? 

Boris.

-----Original Message-----
From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il]
Sent: Sunday, December 03, 2006 5:50 AM
To: Boris Shpolyansky
Cc: David Costa; openib-general at openib.org; Robert Houk; Anthony
Vinciguerra; Thomas Babbit
Subject: Re: [openib-general] HPCC benchmark aborts at MPIRandomAccess
test

Boris Shpolyansky wrote:
> Hi David,
>  
> If you are using OFED-1.1 stack and OSU MVAPICH provided with the
> OFED-1.1 package as your MPI layer,
> the attached patch should solve your problem.
>  
> Please, let me know if that helped.
>  
> Regards,
>  
Boris,
Please add this to OFED 1.1 support page

Thanks,
Tziporet

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From Brian at Mellanox.com  Thu Dec  7 08:23:51 2006
From: Brian at Mellanox.com (Brian Sparks)
Date: Thu, 7 Dec 2006 08:23:51 -0800
Subject: [openib-general] Certification Error
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com>

 
FYI: To whom it may concern, I'm seeing a certification error on the
following link:

 
https://openib.org/tiki/tiki-index.php 

 
Regards,

 
Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com <http://www.mellanox.com>  

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061207/a0b14468/attachment.html>

From jsquyres at cisco.com  Thu Dec  7 08:34:21 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 7 Dec 2006 11:34:21 -0500
Subject: [openib-general] Certification Error
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com>
References: <9FA59C95FFCBB34EA5E42C1A8573784F510DBB@mtiexch01.mti.com>
Message-ID: <A0F06BFE-D46B-45F1-8F1C-8AA3AC8D3912@cisco.com>

Correct.  I think it's simply because OFA didn't purchase an SSL  
certificate from a well-know CA (such as Verisign).


On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote:

>
>
> FYI: To whom it may concern, I’m seeing a certification error on  
> the following link:
>
>
>
> https://openib.org/tiki/tiki-index.php
>
>
>
>
>
> Regards,
>
>
>
> Brian Sparks
> Marketing Communications Manager
>
> Mellanox Technologies
> 2900 Stender Way
> Santa Clara, CA 95054
> 408-916-0008 direct  -   408-802-2775 cell
> http://www.mellanox.com
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From sweitzen at cisco.com  Thu Dec  7 08:38:48 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 7 Dec 2006 08:38:48 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9920@xmb-sjc-216.amer.cisco.com>


> > You can't send UDP/multicast traffic at all between IPoIB 
> CM and IPoIB
> > UD?
> 
> With my experimental code, this currently works only if you 
> manually limit the MTU
> for multicast/UD addresses.
> The simplest way to do this is to set up separate interfaces 
> for CM and UD modes.

Separate interfaces as in ib0 vs ib1?  Thus I can use IPoIB HA or IPoIB
CM but not both, which is not very useful.  Speaking of IPoIB CM, will
it work with the OFED IPoIB HA?

Scott


From Brian at Mellanox.com  Thu Dec  7 08:36:45 2006
From: Brian at Mellanox.com (Brian Sparks)
Date: Thu, 7 Dec 2006 08:36:45 -0800
Subject: [openib-general] Certification Error
Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com>

Jeff,
Do you know if/when this problem will be fixed?

Brian Sparks
Marketing Communications Manager

Mellanox Technologies
2900 Stender Way
Santa Clara, CA 95054
408-916-0008 direct  -   408-802-2775 cell 
http://www.mellanox.com 


-----Original Message-----
From: Jeff Squyres [mailto:jsquyres at cisco.com] 
Sent: Thursday, December 07, 2006 8:34 AM
To: Brian Sparks
Cc: openib-general at openib.org
Subject: Re: [openib-general] Certification Error

Correct.  I think it's simply because OFA didn't purchase an SSL  
certificate from a well-know CA (such as Verisign).


On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote:

>
>
> FYI: To whom it may concern, I'm seeing a certification error on  
> the following link:
>
>
>
> https://openib.org/tiki/tiki-index.php
>
>
>
>
>
> Regards,
>
>
>
> Brian Sparks
> Marketing Communications Manager
>
> Mellanox Technologies
> 2900 Stender Way
> Santa Clara, CA 95054
> 408-916-0008 direct  -   408-802-2775 cell
> http://www.mellanox.com
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From Brian.Cain at ge.com  Thu Dec  7 08:33:22 2006
From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare))
Date: Thu, 7 Dec 2006 11:33:22 -0500
Subject: [openib-general] website certificate issues
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C9521A21@mtlexch01.mtl.com>
Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301AC53BA@CINMLVEM11.e2k.ad.ge.com>

> -----Original Message-----
...
> 
> -----Original Message-----
> From: Brian Sparks 
> Sent: Wednesday, December 06, 2006 6:12 PM
> To: Tziporet Koren; Boris Shpolyansky; Aviram Gutman; Sujal Das
> Subject: RE: [openib-general] HPCC benchmark aborts at MPIRandomAccess
> test
> 
> Just going to that link gives me a certificate error.
> 
> 
> Brian Sparks
...

I get the same error.  I just "Accept for this session" and grumble in
silence.  For the folks who aren't getting this error, you might have
imported the certificate into your trust store the first time you
visited it.

The website maintainer should probably get a certificate signed by one
of the root CAs.  If one of the national labs can't afford a
certificate, consider getting your cert signed by CACert.org (I've got
their root cert in my store).  

As a quick workaround, you can stop redirecting http:// traffic to
https://.

BTW, putting the expiration in 2010 is too far into the future.  I think
two years is a good max.

-Brian


From jsquyres at cisco.com  Thu Dec  7 08:40:28 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Thu, 7 Dec 2006 11:40:28 -0500
Subject: [openib-general] Certification Error
In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com>
References: <9FA59C95FFCBB34EA5E42C1A8573784F510DBD@mtiexch01.mti.com>
Message-ID: <4240A8B1-1219-4A48-87B4-0C3014A9EE40@cisco.com>

I am unaware of any plans to purchase a certificate, but I'm  
certainly not the authority (hah) on this issue.

I suppose that someone could purchase a certificate (does OFA have  
funds for this kind of thing?  certificates need to be renewed on a  
periodic basis), but if they do, my $0.02 is that it should be done  
only for the new server.


On Dec 7, 2006, at 11:36 AM, Brian Sparks wrote:

> Jeff,
> Do you know if/when this problem will be fixed?
>
> Brian Sparks
> Marketing Communications Manager
>
> Mellanox Technologies
> 2900 Stender Way
> Santa Clara, CA 95054
> 408-916-0008 direct  -   408-802-2775 cell
> http://www.mellanox.com
>
>
> -----Original Message-----
> From: Jeff Squyres [mailto:jsquyres at cisco.com]
> Sent: Thursday, December 07, 2006 8:34 AM
> To: Brian Sparks
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] Certification Error
>
> Correct.  I think it's simply because OFA didn't purchase an SSL
> certificate from a well-know CA (such as Verisign).
>
>
> On Dec 7, 2006, at 11:23 AM, Brian Sparks wrote:
>
>>
>>
>> FYI: To whom it may concern, I'm seeing a certification error on
>> the following link:
>>
>>
>>
>> https://openib.org/tiki/tiki-index.php
>>
>>
>>
>>
>>
>> Regards,
>>
>>
>>
>> Brian Sparks
>> Marketing Communications Manager
>>
>> Mellanox Technologies
>> 2900 Stender Way
>> Santa Clara, CA 95054
>> 408-916-0008 direct  -   408-802-2775 cell
>> http://www.mellanox.com
>>
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/
>> openib-general
>
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From monil at voltaire.com  Thu Dec  7 08:44:49 2006
From: monil at voltaire.com (Moni Levy)
Date: Thu, 7 Dec 2006 18:44:49 +0200
Subject: [openib-general] [openfabrics-ewg] OFED 1.2 features update -
 RDMA CM etc
In-Reply-To: <457834DF.7030400@dev.mellanox.co.il>
References: <45759B8C.8010408@dev.mellanox.co.il>
	<4575BB05.7040106@ichips.intel.com>
	<4575CD94.8070608@dev.mellanox.co.il>
	<4575D0A8.7080501@ichips.intel.com>
	<20061206083427.GL26787@mellanox.co.il> <45769453.3030509@voltaire.com>
	<20061206101705.GP26787@mellanox.co.il>
	<45770AA3.2040505@ichips.intel.com>
	<6a122cc00612062217j123f80f0xa6da56164e274de@mail.gmail.com>
	<457834DF.7030400@dev.mellanox.co.il>
Message-ID: <6a122cc00612070844g577c50c6p39e2394936ffd794@mail.gmail.com>

On 12/7/06, Tziporet Koren <tziporet at dev.mellanox.co.il> wrote:
> Moni Levy wrote:
> >>
> >> Does OFED want the multicast support in 1.2?
> >>
> >
> > We definitely want the multicast support in 1.2. It's on the wiki (
> > OFED 1.2 release plan and features) and I understood that this was
> > also agreed on at SC06.
> >
> > -- Moni
> >
> >
>
> We want it but it must work properly before we can integrate it.
> Moni - you suggested help in debugging - can you take the test that
> Dotan submitted and debug to understand the failure?

Sure

--Moni

>
> Thanks,
> Tziporet
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>


From jlentini at netapp.com  Thu Dec  7 08:59:37 2006
From: jlentini at netapp.com (James Lentini)
Date: Thu, 7 Dec 2006 11:59:37 -0500 (EST)
Subject: [openib-general] Status of DAT conformance test
In-Reply-To: <4577DE44.5030308@lfbs.rwth-aachen.de>
References: <4577DE44.5030308@lfbs.rwth-aachen.de>
Message-ID: <Pine.LNX.4.64.0612071155450.6264@jlentini-linux.nane.netapp.com>


On Thu, 7 Dec 2006, Boris Bierbaum wrote:

> I'm looking for ways to test the standard conformance of a uDAPL 
> provider. I had a look at the DAT conformance test contained in the 
> DAPL reference implementation, release version gamma 3.2.
> 
> This test doesn't seem to be in a state in which it can be used to 
> test a uDAPL version 1.2 provider, is anybody working to fix this?

I don't know of anyone working on this.

> Which test programs can be recommaned for this purpose?

While there is no official uDAPL 1.2 conformance test, the tests in 
the "test" directory (dapltest, etc.) are fairly comprehensive.


From mshefty at ichips.intel.com  Thu Dec  7 09:29:41 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 07 Dec 2006 09:29:41 -0800
Subject: [openib-general] potential multicast module issue (was Fwd: FW:
 IPoIB on c0-6 and c0-7: problem creating mcgroup)
In-Reply-To: <20061207132956.GC2614@mellanox.co.il>
References: <20061207132956.GC2614@mellanox.co.il>
Message-ID: <45784F85.9040304@ichips.intel.com>

> __osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, 
> scope_state = 0x1, 
> component mask = 0x0000000000010083, 
> expected comp mask = 0x00000000000130c7, 
> MGID: 0xff12601bffff0000 : 0x0000000000000002 
> from port 0x0002c90200209622
>  
> Missing bits in component mask for creating mcgroup:
>  
> IB_MCR_COMPMASK_QKEY    
> IB_MCR_COMPMASK_TCLASS  
> IB_MCR_COMPMASK_SL      
> IB_MCR_COMPMASK_FLOW    

The multicast module uses the component mask given to it by the user when 
sending join requests to the SA.  Was this a join request coming from ipoib (an 
IPv6?)?  There are definite cases where ipoib does a join without doing a create 
that I tried to leave as they were.

- Sean


From ralph.campbell at qlogic.com  Thu Dec  7 10:05:19 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Thu, 07 Dec 2006 10:05:19 -0800
Subject: [openib-general] [PATCH v3 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <4577C149.3050900@voltaire.com>
References: <1165430156.14800.243.camel@brick.pathscale.com>
	<4577C149.3050900@voltaire.com>
Message-ID: <1165514719.14800.273.camel@brick.pathscale.com>

On Thu, 2006-12-07 at 09:22 +0200, Or Gerlitz wrote:
> Ralph Campbell wrote:
> > This version of the patch adds support for ib_dma_alloc_coherent()
> > and ib_dma_free_coherent().  It also fixes the bug Or found in
> > ipath_sync_single_for_cpu() and ipath_sync_single_for_device().
> 
> > This patch implements the interposing DMA mapping functions to allow
> > support for IOMMUs and remove the dependence on phys_to_virt().
> 
> Haven't you said that the ipath driver uses bus_to_virt ?

It did, this patch removes that too.

> > diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c
> > --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> > +++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Tue Dec 05 16:04:53 2006 -0800
> > +/**
> > + * ipath_dma_map_single - Map a kernel virtual address to DMA address
> > + * @dev: The device for which the dma_addr is to be created
> > + * @cpu_addr: The kernel virtual address
> > + * @size: The size of the region in bytes
> > + * @direction: The direction of the DMA
> > + */
> > +static u64 ipath_dma_map_single(struct ib_device *dev,
> > +			        void *cpu_addr, size_t size,
> > +			        enum dma_data_direction direction)
> > +{
> > +	BUG_ON(!valid_dma_direction(direction));
> > +	return (u64) cpu_addr;
> > +}
> 
> The documentation is both over kill in its volume and worse, simply 
> tells a whole different story then what this code is doing. It does not 
> generate DMA address, it does not care about the ib device nor the size 
> or dma direction. Same for all the documentation below.

OK.  I have removed the comments and added the following at the top:

/*
 * The following functions implement driver specific replacements
 * for the ib_dma_*() functions.
 *
 * These functions return kernel virtual addresses instead of
 * device bus addresses since the driver uses the CPU to copy
 * data instead of using hardware DMA.
 */

> > +/**
> > + * ipath_sg_dma_address - Return the DMA address from a scatter/gather entry
> > + * @dev: The device for which the DMA addresses were created
> > + * @sg: The scatter/gather entry
> > + */
> > +static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
> > +{
> > +	return (u64) page_address(sg->page);
> > +}
> 
> this is a bug, you need to add sg->offset
> 
> Or.

Thanks, applied.


From ralphc at pathscale.com  Thu Dec  7 10:47:33 2006
From: ralphc at pathscale.com (Ralph Campbell)
Date: Thu, 07 Dec 2006 10:47:33 -0800
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs DMA
 mapping functions
Message-ID: <1165517253.14800.283.camel@brick.pathscale.com>

This version of the patch fixes ipath_sg_dma_address() and
updates the comments for ipath_dma.c as Or Gerlitz
suggested.


This patch implements the interposing DMA mapping functions to allow
support for IOMMUs and remove the dependence on phys_to_virt() and
bus_to_virt().

From: Ralph Campbell <ralph.campbell at qlogic.com>

diff -r c76ed2f1387b drivers/infiniband/hw/ipath/Makefile
--- a/drivers/infiniband/hw/ipath/Makefile	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/Makefile	Wed Nov 29 13:54:36 2006 -0800
@@ -6,6 +6,7 @@ ib_ipath-y := \
 ib_ipath-y := \
 	ipath_cq.o \
 	ipath_diag.o \
+	ipath_dma.o \
 	ipath_driver.o \
 	ipath_eeprom.o \
 	ipath_file_ops.o \
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_keys.c
--- a/drivers/infiniband/hw/ipath/ipath_keys.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_keys.c	Wed Nov 29 13:54:36 2006 -0800
@@ -134,7 +134,7 @@ int ipath_lkey_ok(struct ipath_qp *qp, s
 	 */
 	if (sge->lkey == 0) {
 		isge->mr = NULL;
-		isge->vaddr = bus_to_virt(sge->addr);
+		isge->vaddr = (void *) sge->addr;
 		isge->length = sge->length;
 		isge->sge_length = sge->length;
 		ret = 1;
@@ -202,12 +202,12 @@ int ipath_rkey_ok(struct ipath_qp *qp, s
 	int ret;
 
 	/*
-	 * We use RKEY == zero for physical addresses
-	 * (see ipath_get_dma_mr).
+	 * We use RKEY == zero for kernel virtual addresses
+	 * (see ipath_get_dma_mr and ipath_dma.c).
 	 */
 	if (rkey == 0) {
 		sge->mr = NULL;
-		sge->vaddr = phys_to_virt(vaddr);
+		sge->vaddr = (void *) vaddr;
 		sge->length = len;
 		sge->sge_length = len;
 		ss->sg_list = NULL;
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_mr.c
--- a/drivers/infiniband/hw/ipath/ipath_mr.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c	Wed Nov 29 13:54:37 2006 -0800
@@ -54,6 +54,8 @@ static inline struct ipath_fmr *to_ifmr(
  * @acc: access flags
  *
  * Returns the memory region on success, otherwise returns an errno.
+ * Note that all DMA addresses should be created via the
+ * struct ib_dma_mapping_ops functions (see ipath_dma.c).
  */
 struct ib_mr *ipath_get_dma_mr(struct ib_pd *pd, int acc)
 {
@@ -149,8 +151,7 @@ struct ib_mr *ipath_reg_phys_mr(struct i
 	m = 0;
 	n = 0;
 	for (i = 0; i < num_phys_buf; i++) {
-		mr->mr.map[m]->segs[n].vaddr =
-			phys_to_virt(buffer_list[i].addr);
+		mr->mr.map[m]->segs[n].vaddr = (void *) buffer_list[i].addr;
 		mr->mr.map[m]->segs[n].length = buffer_list[i].size;
 		mr->mr.length += buffer_list[i].size;
 		n++;
@@ -347,7 +348,7 @@ int ipath_map_phys_fmr(struct ib_fmr *ib
 	n = 0;
 	ps = 1 << fmr->page_shift;
 	for (i = 0; i < list_len; i++) {
-		fmr->mr.map[m]->segs[n].vaddr = phys_to_virt(page_list[i]);
+		fmr->mr.map[m]->segs[n].vaddr = (void *) page_list[i];
 		fmr->mr.map[m]->segs[n].length = ps;
 		if (++n == IPATH_SEGSZ) {
 			m++;
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Nov 29 13:54:37 2006 -0800
@@ -1599,6 +1599,7 @@ int ipath_register_ib_device(struct ipat
 	dev->detach_mcast = ipath_multicast_detach;
 	dev->process_mad = ipath_process_mad;
 	dev->mmap = ipath_mmap;
+	dev->dma_ops = &ipath_dma_mapping_ops;
 
 	snprintf(dev->node_desc, sizeof(dev->node_desc),
 		 IPATH_IDSTR " %s", init_utsname()->nodename);
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_verbs.h
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h	Wed Nov 29 13:28:14 2006 +0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h	Wed Nov 29 13:54:37 2006 -0800
@@ -812,4 +812,6 @@ extern unsigned int ib_ipath_max_srq_wrs
 
 extern const u32 ib_ipath_rnr_table[];
 
+extern struct ib_dma_mapping_ops ipath_dma_mapping_ops;
+
 #endif				/* IPATH_VERBS_H */
diff -r c76ed2f1387b drivers/infiniband/hw/ipath/ipath_dma.c
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/drivers/infiniband/hw/ipath/ipath_dma.c	Thu Dec 07 10:06:46 2006 -0800
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 QLogic, Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <rdma/ib_verbs.h>
+
+#include "ipath_verbs.h"
+
+#define BAD_DMA_ADDRESS ((u64) 0)
+
+/*
+ * The following functions implement driver specific replacements
+ * for the ib_dma_*() functions.
+ *
+ * These functions return kernel virtual addresses instead of
+ * device bus addresses since the driver uses the CPU to copy
+ * data instead of using hardware DMA.
+ */
+
+static int ipath_mapping_error(struct ib_device *dev, u64 dma_addr)
+{
+	return dma_addr == BAD_DMA_ADDRESS;
+}
+
+static u64 ipath_dma_map_single(struct ib_device *dev,
+			        void *cpu_addr, size_t size,
+			        enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+	return (u64) cpu_addr;
+}
+
+static void ipath_dma_unmap_single(struct ib_device *dev,
+				   u64 addr, size_t size,
+				   enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+static u64 ipath_dma_map_page(struct ib_device *dev,
+			      struct page *page,
+			      unsigned long offset,
+			      size_t size,
+			      enum dma_data_direction direction)
+{
+	u64 addr;
+
+	BUG_ON(!valid_dma_direction(direction));
+
+	if (offset + size > PAGE_SIZE) {
+		addr = BAD_DMA_ADDRESS;
+		goto done;
+	}
+
+	addr = (u64) page_address(page);
+	if (addr)
+		addr += offset;
+	/* TODO: handle highmem pages */
+
+done:
+	return addr;
+}
+
+static void ipath_dma_unmap_page(struct ib_device *dev,
+				 u64 addr, size_t size,
+				 enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+int ipath_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents,
+		 enum dma_data_direction direction)
+{
+	u64 addr;
+	int i;
+	int ret = nents;
+
+	BUG_ON(!valid_dma_direction(direction));
+
+	for (i = 0; i < nents; i++) {
+		addr = (u64) page_address(sg[i].page);
+		/* TODO: handle highmem pages */
+		if (!addr) {
+			ret = 0;
+			break;
+		}
+	}
+	return ret;
+}
+
+static void ipath_unmap_sg(struct ib_device *dev,
+			   struct scatterlist *sg, int nents,
+			   enum dma_data_direction direction)
+{
+	BUG_ON(!valid_dma_direction(direction));
+}
+
+static u64 ipath_sg_dma_address(struct ib_device *dev, struct scatterlist *sg)
+{
+	u64 addr = (u64) page_address(sg->page);
+
+	if (addr)
+		addr += sg->offset;
+	return addr;
+}
+
+static unsigned int ipath_sg_dma_len(struct ib_device *dev,
+				     struct scatterlist *sg)
+{
+	return sg->length;
+}
+
+static void ipath_sync_single_for_cpu(struct ib_device *dev,
+				      u64 addr,
+				      size_t size,
+				      enum dma_data_direction dir)
+{
+}
+
+static void ipath_sync_single_for_device(struct ib_device *dev,
+					 u64 addr,
+					 size_t size,
+					 enum dma_data_direction dir)
+{
+}
+
+static void *ipath_dma_alloc_coherent(struct ib_device *dev, size_t size,
+				      u64 *dma_handle, gfp_t flag)
+{
+	struct page *p;
+	void *addr = NULL;
+
+	p = alloc_pages(flag, get_order(size));
+	if (p)
+		addr = page_address(p);
+	if (dma_handle)
+		*dma_handle = (u64) addr;
+	return addr;
+}
+
+static void ipath_dma_free_coherent(struct ib_device *dev, size_t size,
+				    void *cpu_addr, dma_addr_t dma_handle)
+{
+	free_pages((unsigned long) cpu_addr, get_order(size));
+}
+
+struct ib_dma_mapping_ops ipath_dma_mapping_ops = {
+	ipath_mapping_error,
+	ipath_dma_map_single,
+	ipath_dma_unmap_single,
+	ipath_dma_map_page,
+	ipath_dma_unmap_page,
+	ipath_map_sg,
+	ipath_unmap_sg,
+	ipath_sg_dma_address,
+	ipath_sg_dma_len,
+	ipath_sync_single_for_cpu,
+	ipath_sync_single_for_device,
+	ipath_dma_alloc_coherent,
+	ipath_dma_free_coherent
+};


From halr at voltaire.com  Thu Dec  7 11:58:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 14:58:48 -0500
Subject: [openib-general] [PATCH] osm: Routing Tables are full of
 UNREACHABLE instead of real route
In-Reply-To: <45782F7B.1010408@mellanox.co.il>
References: <45782F7B.1010408@mellanox.co.il>
Message-ID: <1165521425.25587.198999.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-07 at 10:12, Eitan Zahavi wrote:
> Hi Hal,
> 
> I resolved the mystery behind the osm.fdbs that is now full of 
> UNREACHABLE instead of correct out ports.
> 
> The problem is a consequence of the new code that does not use the 
> switch LFT blocks for the intermediate LFT assignments:
> The idea of having incremental updates only relies on temporary buffer 
> that the routing algorithm fills.
> Then it is sent to the wire only if there is a diff between the switch 
> LFT tables (from the SMDB) and the temporary buffer.
> 
> So the switch LFT tables are not being directly updated by the routing 
> algorithm - but only by the GetResp obtained as
> reply to the setting. Until this stage of the description - everything 
> looks right.
> 
> But what is wrong is that the dump of LFT tables is invoked before the 
> GetResp is obtained.
> So if only a single sweep is invoked the resulting osm.fdbs show the 
> original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE.
> 
> The patch below is taking the easy way and should be probably revisited. 
> Instead of having a separate algorithm step for dumping out the 
> resulting GetResp data after all LFT responses were obtained it just 
> copies the sent LFT blocks to the SMDB.

Any idea on why the LFT set failed ?

> I think we need to have at least this simple patch until we have the 
> dump move to a new algorithm step.

Good find.

Applied. Thanks.

We'll revisit a longer term solution to this issue.

-- Hal

> Thanks
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>


From swise at opengridcomputing.com  Thu Dec  7 12:14:43 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 14:14:43 -0600
Subject: [openib-general] [ANNOUNCE] - Ammasso Library Git Repository
Message-ID: <1165522483.14449.39.camel@stevo-desktop>

The Ammasso RDMA library is now maintained via git at: 

git://staging.openfabrics.org/~swise/libamso.git


Thanks,


Steve.


From swise at opengridcomputing.com  Thu Dec  7 12:20:11 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 14:20:11 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <1165522811.14449.46.camel@stevo-desktop>

The Chelsio T3 RDMA Library is now maintained via git at:

git://staging.openfabrics.org/~swise/libcxgb3.git

I'm also maintaining a kernel git repository with the various needed
patches for running the T3 device at:

git://staging.openfabrics.org/~swise/cxgb3.git

This repository is based on Linus's git tree as of 2.6.19.  The cxgb3
branch should be checked out to get the latest T3 patches including the
low level Ethernet driver.  This repos will eventually go away
(hopefully :) as the T3 drivers are pulled into 2.6.20.


Thanks,

Steve.


From cap at nsc.liu.se  Thu Dec  7 12:20:48 2006
From: cap at nsc.liu.se (Peter Kjellstrom)
Date: Thu, 7 Dec 2006 21:20:48 +0100
Subject: [openib-general] IBGOLD installation on Red Hat - gcc problem
In-Reply-To: <1165440197.2894.5.camel@julia.et.endace.com>
References: <1165440197.2894.5.camel@julia.et.endace.com>
Message-ID: <200612072120.52117.cap@nsc.liu.se>

On Wednesday 06 December 2006 22:23, vishal wrote:
> Hi,
>
>       Was trying to install IBGOLD on Red Hat 4 (x86_64), and the
> following is the 'error' part from a log file. I couldn't find the
> -Xcompiler option in the gcc manual. Am I missing something ?

First, this list isn't really a good place for IBGD questions, you should 
probably contact mellanox. That said, you are probably missing some packages. 
Make sure you have atleast gcc, libgcc and glibc-devel installed (building 
IBGD will require more but those are probably a start).

/Peter (who has built IBGD from 0.5.0 to 1.8.2 on EL4)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061207/b848588f/attachment.sig>

From robert.j.woodruff at intel.com  Thu Dec  7 13:55:55 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 7 Dec 2006 13:55:55 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>

Steve Wise wrote,
>I'm also maintaining a kernel git repository with the various needed
>patches for running the T3 device at:

>git://staging.openfabrics.org/~swise/cxgb3.git

>This repository is based on Linus's git tree as of 2.6.19.  The cxgb3
>branch should be checked out to get the latest T3 patches including the
>low level Ethernet driver.  This repos will eventually go away
>(hopefully :) as the T3 drivers are pulled into 2.6.20.

It looks like this tree is not based on 2.6.19 for the
drivers/infiniband/core
but some other tree. When I do a git-diff of your tree against
linux-2.6.19
there are diffs in the drivers/infiniband/core that should not be there.

I think you need to rebase your tree on a stock linux 2.6.19 and then
only add
the cxgb3 code, or have a branch that only contains the cxgb3 code and
another 
branch that might contain other newer infiniband/core code if you want
to test with that.
This way, someone can easily do a git-diff of cxgb3 with a stock
linux-2.6.19 to
generate a patch that only contains the Chelsio code.

woody


From swise at opengridcomputing.com  Thu Dec  7 14:09:37 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 16:09:37 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
Message-ID: <1165529377.14449.75.camel@stevo-desktop>

On Thu, 2006-12-07 at 13:55 -0800, Woodruff, Robert J wrote:
> Steve Wise wrote,
> >I'm also maintaining a kernel git repository with the various needed
> >patches for running the T3 device at:
> 
> >git://staging.openfabrics.org/~swise/cxgb3.git
> 
> >This repository is based on Linus's git tree as of 2.6.19.  The cxgb3
> >branch should be checked out to get the latest T3 patches including the
> >low level Ethernet driver.  This repos will eventually go away
> >(hopefully :) as the T3 drivers are pulled into 2.6.20.
> 
> It looks like this tree is not based on 2.6.19 for the
> drivers/infiniband/core
> but some other tree. When I do a git-diff of your tree against
> linux-2.6.19
> there are diffs in the drivers/infiniband/core that should not be there.
> 

It is based on 2.6.19. 

But it also has sean's ucma patch series (the old 7 part patch
series...I haven't updated to his latest 5-par patch set yet or tried to
use his git tree).  Plus it has an iwcm fix from Krishna Kumar that
fixed bugs I've hit during QA testing.

> I think you need to rebase your tree on a stock linux 2.6.19 and then
> only add
> the cxgb3 code, or have a branch that only contains the cxgb3 code and
> another 
> branch that might contain other newer infiniband/core code if you want
> to test with that.

Yea maybe.  For now, you get everything I need to make cxgb3 run on
2.6.19.  I'll think about the multiple branch approach. 

> This way, someone can easily do a git-diff of cxgb3 with a stock
> linux-2.6.19 to
> generate a patch that only contains the Chelsio code.

I'm struggling with maintaining a patch series in-review on lklm and
netdev, plus maintaining a consistent tree that I can QA on and not
introduce bugs from other stuff going into 2.6.20. So I don't want to
just base this tree on Roland's for-2.6.20, as an example.  I really
just want 2.6.19 + stuff needed to run chelsio's T3.  Right now, that is
the UCMA stuff + a few core fixes...

Roland, I welcome your thoughts too on how I should do this.  I'm new to
git.  Also I'm using stgit to maintain the chelsio driver patch series,
so I continually pop it and add fixes to each patch as I fix things, so
the tree really is kind of in-flux...


Steve.


From rdreier at cisco.com  Thu Dec  7 14:13:15 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 Dec 2006 14:13:15 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <1165529377.14449.75.camel@stevo-desktop> (Steve Wise's
	message of "Thu, 07 Dec 2006 16:09:37 -0600")
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
	<1165529377.14449.75.camel@stevo-desktop>
Message-ID: <adalklj9wxw.fsf@cisco.com>

 > Plus it has an iwcm fix from Krishna Kumar that
 > fixed bugs I've hit during QA testing.

Do I have that patch (or is it in 2.6.20 already)?

 > Roland, I welcome your thoughts too on how I should do this.  I'm new to
 > git.  Also I'm using stgit to maintain the chelsio driver patch series,
 > so I continually pop it and add fixes to each patch as I fix things, so
 > the tree really is kind of in-flux...

What you're doing sounds reasonable.  If you want to create a "chelsio
prerequisites" branch that might address Woody's concern -- then a git
diff between the branches would show the chelsio changes only.  And
that would be really cheap to do -- just create a new branch pointing
at the commit before the chelsio stuff in your stack.


From swise at opengridcomputing.com  Thu Dec  7 14:16:13 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 16:16:13 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <adalklj9wxw.fsf@cisco.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
	<1165529377.14449.75.camel@stevo-desktop> <adalklj9wxw.fsf@cisco.com>
Message-ID: <1165529773.14449.80.camel@stevo-desktop>

On Thu, 2006-12-07 at 14:13 -0800, Roland Dreier wrote:
>  > Plus it has an iwcm fix from Krishna Kumar that
>  > fixed bugs I've hit during QA testing.
> 
> Do I have that patch (or is it in 2.6.20 already)?
> 

Yes.

>  > Roland, I welcome your thoughts too on how I should do this.  I'm new to
>  > git.  Also I'm using stgit to maintain the chelsio driver patch series,
>  > so I continually pop it and add fixes to each patch as I fix things, so
>  > the tree really is kind of in-flux...
> 
> What you're doing sounds reasonable.  If you want to create a "chelsio
> prerequisites" branch that might address Woody's concern -- then a git
> diff between the branches would show the chelsio changes only.  And
> that would be really cheap to do -- just create a new branch pointing
> at the commit before the chelsio stuff in your stack.

Lemme try this out and Woody: I'll get back to ya.

Steve.


From robert.j.woodruff at intel.com  Thu Dec  7 14:21:09 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 7 Dec 2006 14:21:09 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014BE756@orsmsx418.amr.corp.intel.com>

Steve wrote,
>Yea maybe.  For now, you get everything I need to make cxgb3 run on
>2.6.19.  I'll think about the multiple branch approach. 

The issue is this. I am working on putting together an OFA integration
tree that integrates several components from several different
developers.
The same will be true when we start to integrate code into OFED 1.2.
Most code will come from Linus's tree, but some code will need to
come directly from the developer's git trees and we will need 
a way to generate a patch for only your code, as we will get things like
the local_sa cache code directly from Sean's. 

So if you can make a branch that only contains the cxgb3 code, it makes
generating a patch with only your code easier, and this will be needed
both for my early OFA integration work and also for OFED 1.2. 
Once your code is upstream, life is easier as we will get it from
linus, until then we'd like a way to patch the existing released kernel
(2.6.19 in this case) with your code. 

make sense ?

woody


From swise at opengridcomputing.com  Thu Dec  7 14:24:10 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 16:24:10 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014BE756@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE756@orsmsx418.amr.corp.intel.com>
Message-ID: <1165530250.14449.85.camel@stevo-desktop>

On Thu, 2006-12-07 at 14:21 -0800, Woodruff, Robert J wrote:
> Steve wrote,
> >Yea maybe.  For now, you get everything I need to make cxgb3 run on
> >2.6.19.  I'll think about the multiple branch approach. 
> 
> The issue is this. I am working on putting together an OFA integration
> tree that integrates several components from several different
> developers.
> The same will be true when we start to integrate code into OFED 1.2.
> Most code will come from Linus's tree, but some code will need to
> come directly from the developer's git trees and we will need 
> a way to generate a patch for only your code, as we will get things like
> the local_sa cache code directly from Sean's. 
> 
> So if you can make a branch that only contains the cxgb3 code, it makes
> generating a patch with only your code easier, and this will be needed
> both for my early OFA integration work and also for OFED 1.2. 
> Once your code is upstream, life is easier as we will get it from
> linus, until then we'd like a way to patch the existing released kernel
> (2.6.19 in this case) with your code. 
> 
> make sense ?

I understand.


From robert.j.woodruff at intel.com  Thu Dec  7 14:24:05 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 7 Dec 2006 14:24:05 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014BE76A@orsmsx418.amr.corp.intel.com>

Steve wrote>
>> 
>> What you're doing sounds reasonable.  If you want to create a
"chelsio
>> prerequisites" branch that might address Woody's concern -- then a
git
>> diff between the branches would show the chelsio changes only.  And
>> that would be really cheap to do -- just create a new branch pointing
>> at the commit before the chelsio stuff in your stack.

>Lemme try this out and Woody: I'll get back to ya.

>Steve.

Thanks
woody


From mshefty at ichips.intel.com  Thu Dec  7 14:24:32 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 07 Dec 2006 14:24:32 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <1165529377.14449.75.camel@stevo-desktop>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
	<1165529377.14449.75.camel@stevo-desktop>
Message-ID: <457894A0.6020002@ichips.intel.com>

> I'm struggling with maintaining a patch series in-review on lklm and
> netdev, plus maintaining a consistent tree that I can QA on and not
> introduce bugs from other stuff going into 2.6.20. So I don't want to
> just base this tree on Roland's for-2.6.20, as an example.  I really
> just want 2.6.19 + stuff needed to run chelsio's T3.  Right now, that is
> the UCMA stuff + a few core fixes...

I'm sure Roland can provide more input here, but what I did was start with 
2.6.19.  Then, for each feature set in SVN, I created a new git branch, reworked 
the SVN patches, and applied them to that branch.  Where I had dependencies, I 
simply branched off one of my branches.  For example, my multicast branch is off 
  my rdma_ucm branch.

My master branch is 2.6.19.  My intent is to update my tree with each new Linux 
release.

As an aside, I created a test-apps branch to throw all my kernel test apps into. 
  (I really didn't want to maintain a branch per test app, since these will 
never merge upstream.)  I included krping in that tree, since i didn't see where 
you were maintaining it, and I didn't want to lose it.

> Roland, I welcome your thoughts too on how I should do this.  I'm new to
> git.  Also I'm using stgit to maintain the chelsio driver patch series,
> so I continually pop it and add fixes to each patch as I fix things, so
> the tree really is kind of in-flux...

I didn't think that you wanted to do this after you've published a tree.  If 
someone clones your tree, then you use stgit to pop a patch, modify it, then 
recommit it, I'm not how a cloned tree reconciles the changes.

- Sean


From caryang at cisco.com  Thu Dec  7 14:27:31 2006
From: caryang at cisco.com (Carl Yang (caryang))
Date: Thu, 7 Dec 2006 14:27:31 -0800
Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support
 foroperation over IPoIB
Message-ID: <A3C7F222C349444DA8B01D881740A0030109116F@xmb-sjc-226.amer.cisco.com>

Or,

Can you please forward me (or to the email alias) "an example bonding
sysfs script which can be used to set bonding to work with patches 1-3?"

Thanks,
Carl


-----Original Message-----
From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Or Gerlitz
Sent: Thursday, November 30, 2006 2:57 AM
To: netdev at vger.kernel.org
Cc: Roland Dreier (rdreier); Jay Vosburgh; openib-general at openib.org
Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support
foroperation over IPoIB

This patch series is a second version (see below link to V1) of the
suggested changes to the bonding driver such that it would be able to
support non ARPHRD_ETHER netdevices for its High-Availability
(active-backup) mode.

The motivation is to enable the bonding driver on its HA mode to work
with the IP over Infiniband (IPoIB) driver. With these patches I was
able to enslave IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast
and ICMP traffic with fail-over and fail-back working fine. My working
env was the net-2.6.20 git.

More over, as IPoIB is also the IB ARP provider for the RDMA CM driver
which is used by native IB ULPs whose addressing scheme is based on IP
(eg iSER, SDP, Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices
**enables** HA for these ULPs. This holds as when the ULP is informed by
the IB HW on the failure of the current IB connection, it just need to
reconnect, where the bonding device will now issue the IB ARP over the
active IPoIB slave.

The first patch changes some of the bond netdevice attributes and
functions to be that of the active slave for the case of the enslaved
device not being of ARPHRD_ETHER type. Basically it overrides those
setting done by ether_setup(), which are netdevice **type** dependent
and hence might be not appropriate for devices of other types. It also
enforces mutual exclusion on bonding slaves from dissimilar ether types,
as was concluded over the v1 discussion.

IPoIB (see Documentation/infiniband/ipoib.txt) MAC address is made of a
3 bytes IB QP (Queue Pair) number and 16 bytes IB port GID (Global ID)
of the port this IPoIB device is bounded to. The QP is a resource
created by the IB HW and the GID is an identifier burned into the HCA (i
have omitted here some details which are not important for the bonding
RFC).

Basically the IPoIB spec and impl. do not allow for setting the MAC
address of an IPoIB device and this work was made under this assumption.

Hence, the second patch allows for enslaving netdevices which do not
support the set_mac_address() function. In that case the bond mac
address is the one of the active slave, where remote peers are notified
on the mac address
(neighbour) change by Gratuitous ARP sent by bonding when fail-over
occurs (this is already done by the bonding code).

Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some
multicast groups (eg the all-hosts 224.0.0.1). Now, since ether_setup()
have set the bonding device type to be ARPHRD_ETHER and address len to
be ETHER_ALEN, the net core code computes a wrong multicast link
address. This is b/c ip_eth_mc_map() is called where for mcast joins
taking place **after** the enslavement another ip_xxx_mc_map() is called
(eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)

The third patch handles this problem by allowing to enslave devices when
the bonding device is not up. Over the discussion held at the previous
post this seemed to be the most clean way to go, where it is not
expected to cause instabilities.

These patches are not enough for configuration of IPoIB bonding through
tools (eg /sbin/ifenslave and /sbin/ifup) provided by packages such as
sysconfig and initscripts, specifically since these tools sets the
bonding device to be UP before enslaving anything. Once this patchset
gets positive/feedback the next step would be to look how to enhance the
tools/packages so it would be possible to bond/enslave with the modified
code. As suggested by the bonding maintainer, this step can potentially
involve converting ifenslave to be a script based on the bonding sysfs
infrastructure rather on the somehow obsoleted
Documentation/networking/ifenslave.c

For the ease of potential testers, I will post an example bonding sysfs
script which can be used to set bonding to work with patches 1-3 (let me
know!)

Or.

changes from V1 (the links point to V1 0-3/3)

http://marc.theaimsgroup.com/?l=linux-netdev&m=115926582209736&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599515568&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599430055&w=2
http://marc.theaimsgroup.com/?l=linux-netdev&m=115926599415729&w=2

+ enforce mutual exclusion on the slaves ether types don't attempt to 
+ set the bond mtu when enslaving a non ARPHRD_ETHER device rather than 
+ hack the bond device ether type through mod params allow enslavement
  when the bond device is not up

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From swise at opengridcomputing.com  Thu Dec  7 14:30:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 07 Dec 2006 16:30:44 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <457894A0.6020002@ichips.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE6BE@orsmsx418.amr.corp.intel.com>
	<1165529377.14449.75.camel@stevo-desktop>
	<457894A0.6020002@ichips.intel.com>
Message-ID: <1165530644.14449.88.camel@stevo-desktop>

On Thu, 2006-12-07 at 14:24 -0800, Sean Hefty wrote:
> > I'm struggling with maintaining a patch series in-review on lklm and
> > netdev, plus maintaining a consistent tree that I can QA on and not
> > introduce bugs from other stuff going into 2.6.20. So I don't want to
> > just base this tree on Roland's for-2.6.20, as an example.  I really
> > just want 2.6.19 + stuff needed to run chelsio's T3.  Right now, that is
> > the UCMA stuff + a few core fixes...
> 
> I'm sure Roland can provide more input here, but what I did was start with 
> 2.6.19.  Then, for each feature set in SVN, I created a new git branch, reworked 
> the SVN patches, and applied them to that branch.  Where I had dependencies, I 
> simply branched off one of my branches.  For example, my multicast branch is off 
>   my rdma_ucm branch.
> 
> My master branch is 2.6.19.  My intent is to update my tree with each new Linux 
> release.
> 
> As an aside, I created a test-apps branch to throw all my kernel test apps into. 
>   (I really didn't want to maintain a branch per test app, since these will 
> never merge upstream.)  I included krping in that tree, since i didn't see where 
> you were maintaining it, and I didn't want to lose it.
> 

Thanks!  I forgot about that stuff!

> > Roland, I welcome your thoughts too on how I should do this.  I'm new to
> > git.  Also I'm using stgit to maintain the chelsio driver patch series,
> > so I continually pop it and add fixes to each patch as I fix things, so
> > the tree really is kind of in-flux...
> 
> I didn't think that you wanted to do this after you've published a tree.  If 
> someone clones your tree, then you use stgit to pop a patch, modify it, then 
> recommit it, I'm not how a cloned tree reconciles the changes.
> 

Well, the life of this T3 git tree will hopefully be short:  we're
trying hard to get the kernel bits of T3 into 2.6.20...

You're right.  Folks cannot back against this tree and do a pull to
refresh.  It'll get balled up.  But Roland's tree is the same way. 


Steve.


From halr at voltaire.com  Thu Dec  7 14:32:29 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 17:32:29 -0500
Subject: [openib-general] [PATCH] Diags/saquery: Add support for querying
	ServiceRecords
Message-ID: <1165530723.25587.203519.camel@hal.voltaire.com>

Diags/saquery: Add support for querying ServiceRecords

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/diags/ChangeLog b/diags/ChangeLog
index 186059c..318f4b9 100644
--- a/diags/ChangeLog
+++ b/diags/ChangeLog
@@ -1,3 +1,8 @@
+2006-12-07  Hal Rosenstock <halr at voltaire.com>
+
+	* src/saquery.c, man/saquery.8: Add support for
+	  querying ServiceRecords
+
 2006-11-21  Hal Rosenstock <halr at voltaire.com>
 
 	* src/perfquery.c: Add support for PerfMgt ClassPortInfo:
diff --git a/diags/man/saquery.8 b/diags/man/saquery.8
index 853effc..5bbc8a2 100644
--- a/diags/man/saquery.8
+++ b/diags/man/saquery.8
@@ -1,11 +1,11 @@
-.TH SAQUERY 8 "October 9, 2006" "OpenIB" "OpenIB Diagnostics"
+.TH SAQUERY 8 "December 7, 2006" "OpenIB" "OpenIB Diagnostics"
 
 .SH NAME
 saquery \- query InfiniBand subnet administration attributes 
 
 .SH SYNOPSIS
 .B saquery 
-[\-h] [\-d] [\-P] [\-N] [\-D] [\-L] i[\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst <src:dst>] [<name>]
+[\-h] [\-d] [\-P] [\-N] [\-D] [\-S] [\-L] i[\-l] [\-G] [\-C] [\-s] [\-g] [\-m] [--src-to-dst <src:dst>] [<name>]
 
 .SH DESCRIPTION
 .PP
@@ -24,6 +24,9 @@ get NodeRecord info
 \fB\-D\fR
 get NodeDescriptions of CAs only
 .TP
+\fB\-S\fR
+get ServiceRecord info
+.TP
 \fB\-L\fR
 return the Lids of the name specified
 .TP
diff --git a/diags/src/saquery.c b/diags/src/saquery.c
index cc39b06..168df01 100644
--- a/diags/src/saquery.c
+++ b/diags/src/saquery.c
@@ -334,6 +334,104 @@ print_multicast_member_record(ib_member_
 }
 
 static void
+print_service_record(ib_service_record_t *p_sr)
+{
+	char buf_service_key[35];
+	char buf_service_name[65];
+
+	sprintf(buf_service_key,
+		"0x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x",
+		p_sr->service_key[0],
+		p_sr->service_key[1],
+		p_sr->service_key[2],
+		p_sr->service_key[3],
+		p_sr->service_key[4],
+		p_sr->service_key[5],
+		p_sr->service_key[6],
+		p_sr->service_key[7],
+		p_sr->service_key[8],
+		p_sr->service_key[9],
+		p_sr->service_key[10],
+		p_sr->service_key[11],
+		p_sr->service_key[12],
+		p_sr->service_key[13],
+		p_sr->service_key[14],
+		p_sr->service_key[15]);
+	strncpy(buf_service_name, (char *)p_sr->service_name, 64);
+	buf_service_name[64] = '\0';
+
+	printf("ServiceRecord dump:\n"
+	       "\t\t\t\tServiceID...............0x%016" PRIx64 "\n"
+	       "\t\t\t\tServiceGID..............0x%016" PRIx64 " : "
+	       "0x%016" PRIx64 "\n"
+	       "\t\t\t\tServiceP_Key............0x%X\n"
+	       "\t\t\t\tServiceLease............0x%X\n"
+	       "\t\t\t\tServiceKey..............%s\n"
+	       "\t\t\t\tServiceName.............%s\n"
+	       "\t\t\t\tServiceData8.1..........0x%X\n"
+	       "\t\t\t\tServiceData8.2..........0x%X\n"
+	       "\t\t\t\tServiceData8.3..........0x%X\n"
+	       "\t\t\t\tServiceData8.4..........0x%X\n"
+	       "\t\t\t\tServiceData8.5..........0x%X\n"
+	       "\t\t\t\tServiceData8.6..........0x%X\n"
+	       "\t\t\t\tServiceData8.7..........0x%X\n"
+	       "\t\t\t\tServiceData8.8..........0x%X\n"
+	       "\t\t\t\tServiceData8.9..........0x%X\n"
+	       "\t\t\t\tServiceData8.10.........0x%X\n"
+	       "\t\t\t\tServiceData8.11.........0x%X\n"
+	       "\t\t\t\tServiceData8.12.........0x%X\n"
+	       "\t\t\t\tServiceData8.13.........0x%X\n"
+	       "\t\t\t\tServiceData8.14.........0x%X\n"
+	       "\t\t\t\tServiceData8.15.........0x%X\n"
+	       "\t\t\t\tServiceData8.16.........0x%X\n"
+	       "\t\t\t\tServiceData16.1.........0x%X\n"
+	       "\t\t\t\tServiceData16.2.........0x%X\n"
+	       "\t\t\t\tServiceData16.3.........0x%X\n"
+	       "\t\t\t\tServiceData16.4.........0x%X\n"
+	       "\t\t\t\tServiceData16.5.........0x%X\n"
+	       "\t\t\t\tServiceData16.6.........0x%X\n"
+	       "\t\t\t\tServiceData16.7.........0x%X\n"
+	       "\t\t\t\tServiceData16.8.........0x%X\n"
+	       "\t\t\t\tServiceData32.1.........0x%X\n"
+	       "\t\t\t\tServiceData32.2.........0x%X\n"
+	       "\t\t\t\tServiceData32.3.........0x%X\n"
+	       "\t\t\t\tServiceData32.4.........0x%X\n"
+	       "\t\t\t\tServiceData64.1.........0x%016" PRIx64 "\n"
+	       "\t\t\t\tServiceData64.2.........0x%016" PRIx64 "\n"
+	       "",
+	       cl_ntoh64( p_sr->service_id ),
+	       cl_ntoh64( p_sr->service_gid.unicast.prefix ),
+	       cl_ntoh64( p_sr->service_gid.unicast.interface_id ),
+	       cl_ntoh16( p_sr->service_pkey ),
+	       cl_ntoh32( p_sr->service_lease ),
+	       buf_service_key,
+	       buf_service_name,
+	       p_sr->service_data8[0], p_sr->service_data8[1],
+	       p_sr->service_data8[2], p_sr->service_data8[3],
+	       p_sr->service_data8[4], p_sr->service_data8[5],
+	       p_sr->service_data8[6], p_sr->service_data8[7],
+	       p_sr->service_data8[8], p_sr->service_data8[9],
+	       p_sr->service_data8[10], p_sr->service_data8[11],
+	       p_sr->service_data8[12], p_sr->service_data8[13],
+	       p_sr->service_data8[14], p_sr->service_data8[15],
+	       cl_ntoh16(p_sr->service_data16[0]),
+	       cl_ntoh16(p_sr->service_data16[1]),
+	       cl_ntoh16(p_sr->service_data16[2]),
+	       cl_ntoh16(p_sr->service_data16[3]),
+	       cl_ntoh16(p_sr->service_data16[4]),
+	       cl_ntoh16(p_sr->service_data16[5]),
+	       cl_ntoh16(p_sr->service_data16[6]),
+	       cl_ntoh16(p_sr->service_data16[7]),
+	       cl_ntoh32(p_sr->service_data32[0]),
+	       cl_ntoh32(p_sr->service_data32[1]),
+	       cl_ntoh32(p_sr->service_data32[2]),
+	       cl_ntoh32(p_sr->service_data32[3]),
+	       cl_ntoh64(p_sr->service_data64[0]),
+	       cl_ntoh64(p_sr->service_data64[1])
+	      );
+}
+
+static void
 return_mad(void)
 {
 	/*
@@ -645,6 +743,26 @@ print_multicast_group_records(osm_bind_h
 	return (status);
 }
 
+static ib_api_status_t 
+print_service_records(osm_bind_handle_t bind_handle)
+{
+	int                  i = 0;
+	ib_service_record_t  *service_record = NULL;
+	ib_net16_t           attr_offset = ib_get_attr_offset(sizeof(*service_record));
+	ib_api_status_t      status;
+
+	status = get_all_records(bind_handle, IB_MAD_ATTR_SERVICE_RECORD, attr_offset, 0);
+	if (status != IB_SUCCESS)
+		return (status);
+
+	for (i = 0; i < result.result_cnt; i++) {
+		service_record = osmv_get_query_svc_rec(result.p_result_madw, i);
+		print_service_record(service_record);
+	}
+	return_mad();
+	return (status);
+}
+
 static osm_bind_handle_t
 get_bind_handle(void)
 {
@@ -729,12 +847,13 @@ clean_up(void)
 static void
 usage(void)
 {
-	fprintf(stderr, "Usage: %s [-h -d -P -N -D -L -l -G -C -s -g -m --src-to-dst <src:dst>] [<name>]\n", argv0);
+	fprintf(stderr, "Usage: %s [-h -d -P -N -D -S -L -l -G -C -s -g -m --src-to-dst <src:dst>] [<name>]\n", argv0);
 	fprintf(stderr, "   Queries node records by default\n");
 	fprintf(stderr, "   -d enable debugging\n");
 	fprintf(stderr, "   -P get PathRecord info\n");
 	fprintf(stderr, "   -N get NodeRecord info\n");
 	fprintf(stderr, "   -D get NodeDescriptions of CAs only\n");
+	fprintf(stderr, "   -S get ServiceRecord info\n");
 	fprintf(stderr, "   -L return the Lids of the name specified\n");
 	fprintf(stderr, "   -l return the unique Lid of the name specified\n");
 	fprintf(stderr, "   -G return the Guids of the name specified\n");
@@ -758,7 +877,7 @@ main(int argc, char **argv)
 	ib_net16_t         dst_lid;
 	ib_api_status_t    status;
 
-	static char const str_opts[] = "PNDLlGCsgmdh";
+	static char const str_opts[] = "PNDLlGCSsgmdh";
 	static const struct option long_opts [] = {
 	   {"P", 0, 0, 'P'},
 	   {"N", 0, 0, 'N'},
@@ -771,6 +890,7 @@ main(int argc, char **argv)
 	   {"m", 0, 0, 'm'},
 	   {"d", 0, 0, 'd'},
 	   {"C", 0, 0, 'C'},
+	   {"S", 0, 0, 'S'},
 	   {"help", 0, 0, 'h'},
 	   {"src-to-dst", 1, 0, 1},
 	   { }
@@ -806,6 +926,9 @@ main(int argc, char **argv)
 		case 'C':
 			query_type = IB_MAD_ATTR_CLASS_PORT_INFO;
 			break;
+		case 'S':
+			query_type = IB_MAD_ATTR_SERVICE_RECORD;
+			break;
 		case 'N':
 			query_type = IB_MAD_ATTR_NODE_RECORD;
 			break;
@@ -871,6 +994,9 @@ main(int argc, char **argv)
 	case IB_MAD_ATTR_MCMEMBER_RECORD:
 		status = print_multicast_group_records(bind_handle, members);
 		break;
+	case IB_MAD_ATTR_SERVICE_RECORD:
+		status = print_service_records(bind_handle);
+		break;
 	default:
 		fprintf(stderr, "Unknown query type %d\n", query_type);
 		status = IB_UNKNOWN_ERROR;


From rdreier at cisco.com  Thu Dec  7 14:44:59 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 Dec 2006 14:44:59 -0800
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com> (
	Eric Barton's message of "Thu, 7 Dec 2006 11:04:22 GMT")
References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>
Message-ID: <ada4ps79vh0.fsf@cisco.com>

 > ...but is this the right thing to do?  It's the "USER" in
 > IB_USER_VERBS_ABI_VERSION that's making me nervous since this is kernel code.

No, this is utterly wrong -- the userspace verbs ABI has nothing to do
with the in-kernel API (which changes at any time with no notice).

 > Actually a single OFED version #define would most probably suit my purposes -
 > is that controversial?

It might be sensible for OFED to supply that, if it's going to
backport drivers to old kernels.  But you should also cope with
non-OFED (vanilla upstream) drivers, probably by testing
LINUX_VERSION_CODE too I suppose.


From halr at voltaire.com  Thu Dec  7 14:48:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 07 Dec 2006 17:48:08 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
Message-ID: <1165531651.25587.204056.camel@hal.voltaire.com>

Hi Eitan,

Just wanted to close the loop on the OpenSM issues of the last couple
days.

1. When can you supply an OpenSM verbose log for the InformInfo
subscribe problem you reported earlier today ? Failing that, I don't
know how to reproduce this.

2. With the latest tree, do your simulation tests now work ? The
osm.fdbs UNREACHABLE was only a problem with the file and not with the
LFTs in the network.

3. In terms of file format changes, the lack of any file versioning
makes it difficult to move these forward when the need arises. (The
format change to osm.mcfdbs was unintentional (not by design)).

4. I encourage you to look at and comment on the OpenSM patches rather
than waiting for them to be in the tree.

Thanks for your help in finding the bugs sooner.

-- Hal


From rdreier at cisco.com  Thu Dec  7 16:06:15 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 07 Dec 2006 16:06:15 -0800
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <20061206072604.GC26787@mellanox.co.il> (Michael S.
	Tsirkin's message of "Wed, 6 Dec 2006 09:26:04 +0200")
References: <adavekqau41.fsf@cisco.com> <20061206072604.GC26787@mellanox.co.il>
Message-ID: <adaodqf8d54.fsf@cisco.com>

 > I know. Still, this only happens if you enable CM. Maybe it will help
 > to mention this in the comment in KConfig? Log a message as well?

Logging a message might help a tiny bit.  But the Kconfig help text is
useless -- most naive users will be running distro kernels or OFED,
which I assume will enable CM by default.

 > I have a notion that once this code is upstream we can work on
 > ways to teach kernel about net devices where MTU changes dynamically.
 > Or possibly, some tricks with icmp can make it work.

I think it would be better to use ethtool or something similar to
explicitly enable CM.  At least until there's a way to make multicast
work on an interface using CM.


From sweitzen at cisco.com  Thu Dec  7 16:35:44 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 7 Dec 2006 16:35:44 -0800
Subject: [openib-general] Multicast Group Routing Question
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9C99@xmb-sjc-216.amer.cisco.com>

What OS and kernel are you using?  I just took a closer look on RHEL4 U4
2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1,
where sending IP multicast traffic causes the data to go to all hosts.
I do not see this problem when the sender is SLES10 i686 or RHEL4 U3.

This looks to be related to
http://openib.org/bugzilla/show_bug.cgi?id=266.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell
> Sent: Wednesday, December 06, 2006 9:53 AM
> To: openib-general at openib.org
> Subject: [openib-general] Multicast Group Routing Question
> 
> Hello,
> 
>   I was testing our code and noticed that when I send data using 
> multicast over our ib0 interface, all of the infiniband 
> switches route 
> the data to each switch and each node instead of a node that has an 
> application listening to the interface like Ethernet. Is this 
> by design?
> 
> Thanks in advance,
> 
> Sean
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From bugzilla-daemon at openib.org  Thu Dec  7 16:41:55 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu,  7 Dec 2006 16:41:55 -0800 (PST)
Subject: [openib-general] [Bug 266] IPoIB multicast does not work with RHEL4
	U4
Message-ID: <20061208004155.84F972283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=266


------- Comment #4 from sweitzen at cisco.com  2006-12-07 16:41 -------
What OS and kernel are you using?  I just took a closer look on RHEL4 U4
2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1,
where sending IP multicast traffic causes the data to go to all hosts.
I do not see this problem when the sender is SLES10 i686 or RHEL4 U3.


> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell
> Sent: Wednesday, December 06, 2006 9:53 AM
> To: openib-general at openib.org
> Subject: [openib-general] Multicast Group Routing Question
> 
> Hello,
> 
>   I was testing our code and noticed that when I send data using 
> multicast over our ib0 interface, all of the infiniband 
> switches route 
> the data to each switch and each node instead of a node that has an 
> application listening to the interface like Ethernet. Is this 
> by design?
> 
> Thanks in advance,
> 
> Sean


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From robert.j.woodruff at intel.com  Thu Dec  7 16:41:19 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 7 Dec 2006 16:41:19 -0800
Subject: [openib-general] [TRIVIAL] ipoib connected mode makefile bug
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014BE9DC@orsmsx418.amr.corp.intel.com>

I tried to build the ipoib connected mode support and had to modify the 
IPoIB Makefile with the following patch to make it build correctly.


woody
 

--- drivers/infiniband/ulp/ipoib/Makefile       2006-12-07
15:39:51.000000000 -0800
+++ drivers/infiniband/ulp/ipoib/Makefile.new   2006-12-07
16:35:08.000000000 -0800
@@ -6,5 +6,5 @@ ib_ipoib-y                                      :=
ipoib_main.o \
                                                   ipoib_verbs.o \
                                                   ipoib_vlan.o
 ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG)      += ipoib_fs.o
-ib_ipoib-$(INFINIBAND_IPOIB_CM)                += ipoib_cm.o
+ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM)         += ipoib_cm.o


From yhkim93 at keti.re.kr  Thu Dec  7 18:49:27 2006
From: yhkim93 at keti.re.kr (=?euc-kr?B?sei/tciv?=)
Date: Fri, 8 Dec 2006 11:49:27 +0900 (KST)
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
Message-ID: <28996683.1165546167039.JavaMail.kebi@nuri>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061208/e83518bb/attachment.html>

From mst at mellanox.co.il  Fri Dec  8 03:04:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 8 Dec 2006 13:04:04 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9920@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9920@xmb-sjc-216.amer.cisco.com>
Message-ID: <20061208110404.GA31845@mellanox.co.il>

>> > You can't send UDP/multicast traffic at all between IPoIB 
>> CM and IPoIB
>> > UD?
>> 
>> With my experimental code, this currently works only if you 
>> manually limit the MTU
>> for multicast/UD addresses.
>> The simplest way to do this is to set up separate interfaces 
>> for CM and UD modes.
>
>Separate interfaces as in ib0 vs ib1?
>Thus I can use IPoIB HA or IPoIB
>CM but not both, which is not very useful.

There are many ways to use both IPoIB HA and IPoIB CM at the same time.
You can create a child interface and use that for IPoIB CM.
Or you can force lower MTU for UD destinations in
the routing table.

>Speaking of IPoIB CM, willit work with the OFED IPoIB HA?

Should work.

-- 
MST


From mst at mellanox.co.il  Fri Dec  8 03:08:01 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 8 Dec 2006 13:08:01 +0200
Subject: [openib-general] [Bug 266] IPoIB multicast does not work with
	RHEL4 U4
In-Reply-To: <20061208004155.84F972283D4@openib.ca.sandia.gov>
References: <20061208004155.84F972283D4@openib.ca.sandia.gov>
Message-ID: <20061208110801.GB31845@mellanox.co.il>

This is a bug in RHEL4 U4.
The issue is documented in OFED release notes, the solution is
is to stay with U3.

Quoting r. bugzilla-daemon at openib.org <bugzilla-daemon at openib.org>:
Subject: [Bug 266] IPoIB multicast does not work with RHEL4 U4

http://openib.org/bugzilla/show_bug.cgi?id=266


------- Comment #4 from sweitzen at cisco.com  2006-12-07 16:41 -------
What OS and kernel are you using?  I just took a closer look on RHEL4 U4
2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1,
where sending IP multicast traffic causes the data to go to all hosts.
I do not see this problem when the sender is SLES10 i686 or RHEL4 U3.


> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell
> Sent: Wednesday, December 06, 2006 9:53 AM
> To: openib-general at openib.org
> Subject: [openib-general] Multicast Group Routing Question
> 
> Hello,
> 
>   I was testing our code and noticed that when I send data using 
> multicast over our ib0 interface, all of the infiniband 
> switches route 
> the data to each switch and each node instead of a node that has an 
> application listening to the interface like Ethernet. Is this 
> by design?
> 
> Thanks in advance,
> 
> Sean


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From mst at mellanox.co.il  Fri Dec  8 03:09:30 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 8 Dec 2006 13:09:30 +0200
Subject: [openib-general] [PATCH] IPoIB CM Experimental support
In-Reply-To: <adaodqf8d54.fsf@cisco.com>
References: <adavekqau41.fsf@cisco.com>
	<20061206072604.GC26787@mellanox.co.il> <adaodqf8d54.fsf@cisco.com>
Message-ID: <20061208110930.GC31845@mellanox.co.il>

>  > I know. Still, this only happens if you enable CM. Maybe it will help
>  > to mention this in the comment in KConfig? Log a message as well?
> 
> Logging a message might help a tiny bit.  But the Kconfig help text is
> useless -- most naive users will be running distro kernels or OFED,
> which I assume will enable CM by default.
> 
>  > I have a notion that once this code is upstream we can work on
>  > ways to teach kernel about net devices where MTU changes dynamically.
>  > Or possibly, some tricks with icmp can make it work.
> 
> I think it would be better to use ethtool or something similar to
> explicitly enable CM.  At least until there's a way to make multicast
> work on an interface using CM.

Thanks for the suggestion, I'll look into that.

-- 
MST


From shubbell at dbresearch.net  Fri Dec  8 05:04:30 2006
From: shubbell at dbresearch.net (Sean Hubbell)
Date: Fri, 08 Dec 2006 07:04:30 -0600
Subject: [openib-general] [Bug 266] IPoIB multicast does not work with
 RHEL4 U4
In-Reply-To: <20061208004155.84F972283D4@openib.ca.sandia.gov>
References: <20061208004155.84F972283D4@openib.ca.sandia.gov>
Message-ID: <457962DE.2080407@dbresearch.net>


Centos Linux neptune 2.6.9-42.0.3.plus.c4smp #1 SMP Fri Oct 6 11:42:04 
CDT 2006 x86_64 GNU/Linux (Isn't Centos us a lot of the RH rpms?).

Sean

bugzilla-daemon at openib.org wrote:
> http://openib.org/bugzilla/show_bug.cgi?id=266
>
>
>
>
>
> ------- Comment #4 from sweitzen at cisco.com  2006-12-07 16:41 -------
> What OS and kernel are you using?  I just took a closer look on RHEL4 U4
> 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with OFED 1.1,
> where sending IP multicast traffic causes the data to go to all hosts.
> I do not see this problem when the sender is SLES10 i686 or RHEL4 U3.
>
>
>   
>> -----Original Message-----
>> From: openib-general-bounces at openib.org 
>> [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell
>> Sent: Wednesday, December 06, 2006 9:53 AM
>> To: openib-general at openib.org
>> Subject: [openib-general] Multicast Group Routing Question
>>
>> Hello,
>>
>>   I was testing our code and noticed that when I send data using 
>> multicast over our ib0 interface, all of the infiniband 
>> switches route 
>> the data to each switch and each node instead of a node that has an 
>> application listening to the interface like Ethernet. Is this 
>> by design?
>>
>> Thanks in advance,
>>
>> Sean
>>     
>
>
>
>
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>
>   


From eitan at mellanox.co.il  Fri Dec  8 08:42:13 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 08 Dec 2006 18:42:13 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165531651.25587.204056.camel@hal.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
Message-ID: <457995E5.40303@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> Just wanted to close the loop on the OpenSM issues of the last couple
> days.
>
> 1. When can you supply an OpenSM verbose log for the InformInfo
> subscribe problem you reported earlier today ? Failing that, I don't
> know how to reproduce this.
>   
Attached
> 2. With the latest tree, do your simulation tests now work ? The
> osm.fdbs UNREACHABLE was only a problem with the file and not with the
> LFTs in the network.
>   
Yes they do.
> 3. In terms of file format changes, the lack of any file versioning
> makes it difficult to move these forward when the need arises. (The
> format change to osm.mcfdbs was unintentional (not by design)).
>   
The issues until now were not that a file format change was required but 
were unintentional.
When we will have a real need to change file format I am sure we can 
agree on adding version and change all parsers.
> 4. I encourage you to look at and comment on the OpenSM patches rather
> than waiting for them to be in the tree.
>   
I am sure you did not mean to, but now I have to admit my limited skills 
in catching bugs by reading patches :-( .
Instead on relying on bug reading I use automatic regression. I wish we 
could agree on some regression that
each developer will have to run before patches are committed to the trunk.
On my side I would love to have an automatic way to include all the 
patches posted (one at a time) run "dead or alive" check
and provide feedback. Currently my automation is limited to testing the 
trunk. So I will always be complaining after the patches are
committed. I think this is the way most other components testing works.

What kind of regression suite do you and Sasha use?
Can we agree on minimal pre-commit testing?
Can we have a branch for that sake where all patches will first have to 
go into for 2 days? (it will allow for pre-trunk testing).


> Thanks for your help in finding the bugs sooner.
>
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


-------------- next part --------------
A non-text attachment was scrubbed...
Name: ibmgtsim.13801.tar.bz2
Type: application/octet-stream
Size: 505618 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061208/c63e08eb/attachment.obj>

From sweitzen at cisco.com  Fri Dec  8 09:47:51 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Fri, 8 Dec 2006 09:47:51 -0800
Subject: [openib-general] [Bug 266] IPoIB multicast does not work with
	RHEL4 U4
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302AD9E7B@xmb-sjc-216.amer.cisco.com>

The OFED 1.1 IPoIB release notes state "5. On RedHat EL 4 up4, ipoib
multicast group membership does not work due to missing code in the
kernel which was available in u3 and removed in u4.", which is a good
hint, but I just want to clarify that U4 can only receive multicast from
U4, and U4 sends multicast to all nodes.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> Sent: Friday, December 08, 2006 3:08 AM
> To: Scott Weitzenkamp (sweitzen)
> Cc: openib-general at openib.org
> Subject: Re: [Bug 266] IPoIB multicast does not work with RHEL4 U4
> 
> This is a bug in RHEL4 U4.
> The issue is documented in OFED release notes, the solution is
> is to stay with U3.
> 
> Quoting r. bugzilla-daemon at openib.org <bugzilla-daemon at openib.org>:
> Subject: [Bug 266] IPoIB multicast does not work with RHEL4 U4
> 
> http://openib.org/bugzilla/show_bug.cgi?id=266
> 
> 
> 
> 
> 
> ------- Comment #4 from sweitzen at cisco.com  2006-12-07 16:41 -------
> What OS and kernel are you using?  I just took a closer look 
> on RHEL4 U4
> 2.6.9-42.Elsmp x86_64, and I am seeting the same problem with 
> OFED 1.1,
> where sending IP multicast traffic causes the data to go to all hosts.
> I do not see this problem when the sender is SLES10 i686 or RHEL4 U3.
> 
> 
> > -----Original Message-----
> > From: openib-general-bounces at openib.org 
> > [mailto:openib-general-bounces at openib.org] On Behalf Of Sean Hubbell
> > Sent: Wednesday, December 06, 2006 9:53 AM
> > To: openib-general at openib.org
> > Subject: [openib-general] Multicast Group Routing Question
> > 
> > Hello,
> > 
> >   I was testing our code and noticed that when I send data using 
> > multicast over our ib0 interface, all of the infiniband 
> > switches route 
> > the data to each switch and each node instead of a node that has an 
> > application listening to the interface like Ethernet. Is this 
> > by design?
> > 
> > Thanks in advance,
> > 
> > Sean
> 
> 
> 
> 
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 
> -- 
> MST
> 


From or.gerlitz at gmail.com  Fri Dec  8 10:45:00 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 8 Dec 2006 20:45:00 +0200
Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support
 foroperation over IPoIB
In-Reply-To: <A3C7F222C349444DA8B01D881740A0030109116F@xmb-sjc-226.amer.cisco.com>
References: <A3C7F222C349444DA8B01D881740A0030109116F@xmb-sjc-226.amer.cisco.com>
Message-ID: <15ddcffd0612081045s569bd04at8489f35e32fe6bcc@mail.gmail.com>

On 12/8/06, Carl Yang (caryang) <caryang at cisco.com> wrote:
> Can you please forward me (or to the email alias) "an example bonding
> sysfs script which can be used to set bonding to work with patches 1-3?"

Sure, i did it along with sending the patches, you can the thing here:
http://marc.theaimsgroup.com/?l=linux-netdev&m=116488445829045&w=2

Or.


From halr at voltaire.com  Fri Dec  8 11:23:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 14:23:36 -0500
Subject: [openib-general] OpenSM/osm_remote_sm.h: Eliminate unused is_opensm
	boolean
Message-ID: <1165605794.25587.256398.camel@hal.voltaire.com>

OpenSM/osm_remote_sm.h: Eliminate unused is_opensm boolean

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_remote_sm.h b/osm/include/opensm/osm_remote_sm.h
index 68359c6..6e67b7c 100644
--- a/osm/include/opensm/osm_remote_sm.h
+++ b/osm/include/opensm/osm_remote_sm.h
@@ -97,7 +97,6 @@ typedef struct _osm_remote_sm
 	cl_map_item_t				map_item;
 	const osm_port_t			*p_port;
 	ib_sm_info_t				smi;
-	boolean_t				is_opensm;
 } osm_remote_sm_t;
 /*
 * FIELDS
@@ -109,10 +108,6 @@ typedef struct _osm_remote_sm
 *	smi
 *		The SMInfo attribute for this SM.
 *
-*	is_opensm
-*		TRUE if this SM is an OpenSM.
-*		FALSE otherwise.
-*
 * SEE ALSO
 *********/
 

From eric at barton.org.uk  Fri Dec  8 11:55:44 2006
From: eric at barton.org.uk (Eric Barton)
Date: Fri, 8 Dec 2006 19:55:44 -0000
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <ada4ps79vh0.fsf@cisco.com>
Message-ID: <045401c71b02$d8d17a40$0281a8c0@ebpc>

>  > Actually a single OFED version #define would most probably 
>  > suit my purposes -
>  > is that controversial?
> 
> It might be sensible for OFED to supply that, if it's going to
> backport drivers to old kernels.  But you should also cope with
> non-OFED (vanilla upstream) drivers, probably by testing
> LINUX_VERSION_CODE too I suppose.

How about an OpenFabrics API version #define?

Living in hope...

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb at bartonsoftware.com|
---------------------------------------------------


From vu at mellanox.com  Fri Dec  8 12:10:43 2006
From: vu at mellanox.com (Vu Pham)
Date: Fri, 08 Dec 2006 12:10:43 -0800
Subject: [openib-general] nfsrdma server stop responding,
Message-ID: <4579C6C3.5090207@mellanox.com>

Hi James,
   I got these errors in server's /var/log/messages and then the server 
stop responding to login, I/O...; however, the server is still up, ipoib 
is still working


Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]  
[<ffffffff8025dff7>] put_page+0x17/0x40
Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 00010246
Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 
0000000000000001 RCX: 000000000003ffff
Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 
0000000000000001 RDI: ffff8102274e92f8
Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 
0000000000000034 R09: 0000000000000000
Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: ffff81020ef96800
Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 
0000000000000000 R15: ffff8102053ee890
Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000) 
GS:ffff81022066eb40(0000) knlGS:0000000000000000
Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 
000000021c22b000 CR4: 00000000000006e0
Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo 
ffff810219dde000, task ffff81020d87f0c0)
Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 ffff81020ef96968 
ffff81020ef96800 ffff81020ef96958
Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90 
ffffffff80424e05 0000000000000000
Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90 
ffffffff80239b90 ffff81020d87f0c0
Dec  8 06:38:21 ibd201 kernel: Call Trace:
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>] 
:sunrpc:svc_rdma_put_context+0x37/0xd0
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>] 
:sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>] 
schedule_timeout+0x95/0xb0
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] 
process_timeout+0x0/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>] 
wait_for_completion_timeout+0xcd/0x150
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>] 
default_wake_function+0x0/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>] 
:ib_mthca:mthca_cmd_post+0x232/0x260
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>] 
default_wake_function+0x0/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] __next_cpu+0x19/0x30
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>] 
find_busiest_group+0x24e/0x6d0
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] thread_return+0x0/0xde
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>] 
_spin_unlock_irqrestore+0x8/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>] 
try_to_del_timer_sync+0x51/0x60
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] del_timer_sync+0xc/0x20
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>] 
schedule_timeout+0x95/0xb0
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>] 
:sunrpc:svc_recv+0x416/0x510
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>] 
default_wake_function+0x0/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>] 
default_wake_function+0x0/0x10
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] :nfsd:nfsd+0x111/0x380
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] child_rip+0xa/0x12
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] child_rip+0x0/0x12
Dec  8 06:38:21 ibd201 kernel:
Dec  8 06:38:21 ibd201 kernel:
Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 
4f 08 0f 94 c0 84 c0 74
Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] put_page+0x17/0x40
Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>

-vu


From rdreier at cisco.com  Fri Dec  8 12:17:54 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 08 Dec 2006 12:17:54 -0800
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <045401c71b02$d8d17a40$0281a8c0@ebpc> (Eric Barton's
	message of "Fri, 8 Dec 2006 19:55:44 -0000")
References: <045401c71b02$d8d17a40$0281a8c0@ebpc>
Message-ID: <adapsau6t1p.fsf@cisco.com>

 > How about an OpenFabrics API version #define?

No other kernel subsystem has one, so I don't think it's realistic to
expect one for IB.

 - R.


From halr at voltaire.com  Fri Dec  8 13:05:46 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 16:05:46 -0500
Subject: [openib-general] [PATCH][TRIVIAL] osmtest/osmtest.c: Fix endian of
 capability mask output
Message-ID: <1165611934.26559.214.camel@hal.voltaire.com>

osmtest/osmtest.c: Fix endian of capability mask output

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index b3f2bb4..6a571f5 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -3749,7 +3749,8 @@ osmtest_validate_port_data( IN osmtest_t
              "Field mismatch port LID 0x%X Num:0x%X\n"
              "\t\t\t\tExpected capability_mask 0x%X, received 0x%X\n",
              cl_ntoh16( p_rec->lid ), p_rec->port_num,
-             p_port->rec.port_info.capability_mask, p_rec->port_info.capability_mask );
+             cl_ntoh32( p_port->rec.port_info.capability_mask ),
+             cl_ntoh32( p_rec->port_info.capability_mask ) );
     status = IB_ERROR;
     goto Exit;
   }


From sashak at voltaire.com  Fri Dec  8 13:55:23 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Fri, 8 Dec 2006 23:55:23 +0200
Subject: [openib-general] [PATCH] osm: Routing Tables are full of
 UNREACHABLE instead of real route
In-Reply-To: <45782F7B.1010408@mellanox.co.il>
References: <45782F7B.1010408@mellanox.co.il>
Message-ID: <20061208215523.GF9193@sashak.voltaire.com>

Hi Eitan,

On 17:12 Thu 07 Dec     , Eitan Zahavi wrote:
> Hi Hal,
> 
> I resolved the mystery behind the osm.fdbs that is now full of 
> UNREACHABLE instead of correct out ports.
> 
> The problem is a consequence of the new code that does not use the 
> switch LFT blocks for the intermediate LFT assignments:
> The idea of having incremental updates only relies on temporary buffer 
> that the routing algorithm fills.
> Then it is sent to the wire only if there is a diff between the switch 
> LFT tables (from the SMDB) and the temporary buffer.
> 
> So the switch LFT tables are not being directly updated by the routing 
> algorithm - but only by the GetResp obtained as
> reply to the setting. Until this stage of the description - everything 
> looks right.
> 
> But what is wrong is that the dump of LFT tables is invoked before the 
> GetResp is obtained.
> So if only a single sweep is invoked the resulting osm.fdbs show the 
> original state of the SMDB tables whicg is full of 0xFF = UNREACHABLE.

Right.

> 
> The patch below is taking the easy way and should be probably revisited. 
> Instead of having a separate algorithm step for dumping out the 
> resulting GetResp data after all LFT responses were obtained it just 
> copies the sent LFT blocks to the SMDB.

Would not this be better just to move all dumps at end of the OpenSM
heavy sweep. This should be simple, right?

Sasha

> 
> I think we need to have at least this simple patch until we have the 
> dump move to a new algorithm step.
> 
> Thanks
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> =====================================================================
> 
> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index 5a55da8..3a62c7f 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c
> @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table(
>                "osm_ucast_mgr_set_fwd_table: ERR 3A05: "
>                "Sending linear fwd. tbl. block failed (%s)\n",
>                ib_get_err_str( status ) );
> -    }
> +    } else {
> +       /*
> +         HACK: for now we will assume we succeeded to send
> +         and set the local DB based on it. This should allow
> +         us to immediatly dump out our routing
> +       */
> +       osm_switch_set_ft_block(
> +          p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho);
> +        }
>   }
> 
>   OSM_LOG_EXIT( p_mgr->p_log );
> 


From sashak at voltaire.com  Fri Dec  8 14:10:01 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 00:10:01 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457995E5.40303@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
Message-ID: <20061208221001.GG9193@sashak.voltaire.com>

On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> >Hi Eitan,
> >
> >Just wanted to close the loop on the OpenSM issues of the last couple
> >days.
> >
> >1. When can you supply an OpenSM verbose log for the InformInfo
> >subscribe problem you reported earlier today ? Failing that, I don't
> >know how to reproduce this.
> >  
> Attached
> >2. With the latest tree, do your simulation tests now work ? The
> >osm.fdbs UNREACHABLE was only a problem with the file and not with the
> >LFTs in the network.
> >  
> Yes they do.
> >3. In terms of file format changes, the lack of any file versioning
> >makes it difficult to move these forward when the need arises. (The
> >format change to osm.mcfdbs was unintentional (not by design)).
> >  
> The issues until now were not that a file format change was required but 
> were unintentional.
> When we will have a real need to change file format I am sure we can 
> agree on adding version and change all parsers.
> >4. I encourage you to look at and comment on the OpenSM patches rather
> >than waiting for them to be in the tree.
> >  
> I am sure you did not mean to, but now I have to admit my limited skills 
> in catching bugs by reading patches :-( .
> Instead on relying on bug reading I use automatic regression. I wish we 
> could agree on some regression that
> each developer will have to run before patches are committed to the trunk.
> On my side I would love to have an automatic way to include all the 
> patches posted (one at a time) run "dead or alive" check
> and provide feedback. Currently my automation is limited to testing the 
> trunk. So I will always be complaining after the patches are
> committed. I think this is the way most other components testing works.
> 
> What kind of regression suite do you and Sasha use?

On my side it clearly depends from kind of changes. In general I would
call this "uni-testing".

> Can we agree on minimal pre-commit testing?
> Can we have a branch for that sake where all patches will first have to 
> go into for 2 days? (it will allow for pre-trunk testing).

One more development branch? Will you test (or even see) this? If so I
can publish the "fresh" tree.

Sasha


From venkatesh.babu at 3leafnetworks.com  Fri Dec  8 14:12:03 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 08 Dec 2006 14:12:03 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1164674885.11808.760.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
Message-ID: <4579E333.4000901@3leafnetworks.com>


 I have got the same problem with OFED 1.1 stack also, but the frequency 
is less. I had to try 120 fail overs (by rebooting the highest priority 
OpenSM server) before getting into this problem. At this state OpenSM 
doesn't update anything to the log files; doesn't assign the LIDs to the 
other nodes; doesn't respond to the multi cast join operations. Even 
another OpenSM is started on another node with higher priority it can 
not become the master. The only way to recover from this is by killing 
the stuck OpenSM.

 VBabu

Hal Rosenstock wrote:

>I don't see any explicit changes to the SM state machine which would
>affect this but as I have mentioned before there are many bug fixes in
>OFED 1.1. I can't conclusively state whether this would fix the issue
>you see but would be in a much better position to try to figure this
>out.
>
>-- Hal
>
>  
>
>> Hi
>>
>>   I have topology of two switches and a bunch of nodes, with each 
>> node having 2port HCAs. Port1 of every node connects to switch1 and 
>> Port2 of every node connects to switch2. So Port1 and Port2 are in 
>> different subnets. So I am running one OpenSM (from OFED 1.0) for 
>> each port on one node designated as a server. To guard against that 
>> server going down I have another server node to run the OpenSM in 
>> "standby" mode for each port. I will adjust the priorities such that 
>> first server always has "master" OpenSM and second server has 
>> "standby" OpenSM.
>>
>>    When the first server rebooted, "standby" OpenSM should takeover 
>> the mastership. It usually works fine but sometimes it is failing to 
>> takeover. In the following example OpenSM for Port1 failed to 
>> takeover, but OpenSM for Port2 took over and became "master". The 
>> OpenSM for Port1 seems be stuck in some weired state (strace shows 
>> that it is sleeping). It is no longer assigning LIDs to the rest of 
>> the nodes in the subnet and not responding to the broadcast joins. 
>> The log file shows nothing from past 4 days. I have the complete log 
>> files if needed.
>>
>>    Is this a known problem and fixed in OFED 1.1 ?
>>
>> [root at vortex3l-72 158]# ibv_devinfo
>> hca_id: mthca0
>>        fw_ver:                         5.1.400
>>        node_guid:                      0050:4501:4b1a:0000
>>        sys_image_guid:                 0050:4501:4b1a:0003
>>        vendor_id:                      0x02c9
>>        vendor_part_id:                 25218
>>        hw_ver:                         0xA0
>>        board_id:                       ARM0020000001
>>        phys_port_cnt:                  2
>>                port:   1
>>                        state:                  PORT_ACTIVE (4)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 7
>>                        port_lid:               1
>>                        port_lmc:               0x00
>>
>>                port:   2
>>                        state:                  PORT_ACTIVE (4)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 1
>>                        port_lid:               1
>>                        port_lmc:               0x00
>>
>> [root at vortex3l-72 158]# ps -aux | grep open
>> Warning: bad syntax, perhaps a bogus '-'? See 
>> /usr/share/doc/procps-3.2.3/FAQ
>> root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
>> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
>> /var/log/opensm2.log
>> root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
>> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
>> /var/log/opensm1.log
>> root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
>> [root at vortex3l-72 158]# strace -p7975
>> Process 7975 attached - interrupt to quit
>> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0}, NULL)                = 0
>> nanosleep({10, 0},  <unfinished ...>
>> Process 7975 detached
>> [root at vortex3l-72 158]# uptime
>> 12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
>> [root at vortex3l-72 158]# date
>> Mon Nov 27 12:13:05 PST 2006
>> [root at vortex3l-72 158]#  tail /var/log/opensm1.log
>> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
>> 3673M
>>
>> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:66 from LID:0x0000 
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:66 from LID:0x0000 
>> GID:0xfe80000000000000,0x0000000000000000
>> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
>> 0x5045014b1a0001
>> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
>> 0x5045014b1a0001
>> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
>>
>> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
>>
>> [root at vortex3l-72 158]#  tail /var/log/opensm2.log
>>                                00 00 00 00 00 00 00 00   00 00 00 00 
>> 00 00 00 00
>>
>> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
>> Generic Notice type:3 num:65 from LID:0x0001 
>> GID:0xfe80000000000000,0x005045014b1a0002
>> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0002
>> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0003
>> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
>> Cannot find destination port with LID:0x0004
>> Nov 27 12:10:32 146382 [41401960] -> Removed port with 
>> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
>> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
>> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
>> light sweep sampling list
>> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
>>                                Path = [0][2]
>>


From adit.262 at gmail.com  Fri Dec  8 14:31:49 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Fri, 8 Dec 2006 17:31:49 -0500
Subject: [openib-general] Assigning IP addresses to IB interfaces
Message-ID: <d2ad857f0612081431q6decd412o2718019aaed1ae03@mail.gmail.com>

Hi,

I have installed the OpenIB gen2 driver but the IB interfaces havent
been assigned any IP addresses..
Is it possible to assign them ip addresses using ifconfig and ping
between the interfaces of two machines?

Thanks,
Regards,
Adit

-- 


Adit Ranadive
MS CS Candidate
Georgia Institute of Technology,
Atlanta, GA


From halr at voltaire.com  Fri Dec  8 14:33:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 17:33:18 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457995E5.40303@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
Message-ID: <1165617195.26559.4435.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > Just wanted to close the loop on the OpenSM issues of the last couple
> > days.
> >
> > 1. When can you supply an OpenSM verbose log for the InformInfo
> > subscribe problem you reported earlier today ? Failing that, I don't
> > know how to reproduce this.
> >   
> Attached

Hmmm....

osmtest seems to fail much earlier than OpenSM unless I am mistaken.
OpenSM sees the final InformInfo unsubscribe (cleanup) and fails on
that. I thought the osmtest side failed earlier.

In a number of places in osm.log, I see:
Dec 08 18:17:02 266690 [B2562BB0] -> __osmv_dispatch_rmpp_mad: [
Dec 08 18:17:02 266707 [B2562BB0] -> __osmv_dispatch_rmpp_snd: [
Dec 08 18:17:02 266723 [B2562BB0] -> Not supposed to receive DATA packets --> dropping the MAD
Dec 08 18:17:02 266739 [B2562BB0] -> __osmv_dispatch_rmpp_snd: ]
Dec 08 18:17:02 266755 [B2562BB0] -> __osmv_dispatch_rmpp_mad: ]
Is that supposed to happen ? What does that mean ? Does that mess things up ?

SA GetTable InformInfoRecord
Dec 08 18:17:02 265333 [B6B69BB0] -> osm_infr_rcv_process_get_method: Query Subscriber GID:0x0000000000000000 : 0x0000000000000000(00) Enum:0x0(01)
Dec 08 18:17:02 265370 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: [
Dec 08 18:17:02 265388 [B2562BB0] -> osmv_dispatch_mad: ]
Dec 08 18:17:02 265406 [B6B69BB0] -> osm_infr_get_by_enum: [
Dec 08 18:17:02 265424 [B2562BB0] -> __osmv_ibms_receiver_callback: ]
Dec 08 18:17:02 265443 [B6B69BB0] -> osm_infr_get_by_enum: ]
Dec 08 18:17:02 265482 [B6B69BB0] -> __osm_sa_inform_info_rec_by_comp_mask: ]
Dec 08 18:17:02 265499 [B6B69BB0] -> osm_infr_rcv_process_get_method: Returning 1 records

SA Set InformInfo 
Dec 08 18:17:02 269386 [B756ABB0] -> osm_infr_rcv_process_set_method: UnSubscribe Request with QPN: 0x000001
Dec 08 18:17:02 269421 [B756ABB0] -> osm_infr_get_by_rec: [
Dec 08 18:17:02 269439 [B2562BB0] -> <-- Released lock 0x8d79c20 on bind handle 0x8d79c10
Dec 08 18:17:02 269457 [B756ABB0] -> __dump_all_informs: [
Dec 08 18:17:02 269476 [B2562BB0] -> osmv_dispatch_mad: ]
Dec 08 18:17:02 269496 [B756ABB0] -> InformInfo dump:
                                gid.....................0x0000000000000000 : 0x0000000000000000
                                lid_range_begin.........0x0
                                lid_range_end...........0x0
                                is_generic..............0x0
                                subscribe...............0x1
                                trap_type...............0x0
                                dev_id..................0x0
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                vendor_id...............0x000000
Dec 08 18:17:02 269513 [B2562BB0] -> __osmv_ibms_receiver_callback: ]
Dec 08 18:17:02 269532 [B756ABB0] -> __dump_all_informs: ]
Dec 08 18:17:02 269566 [B756ABB0] -> osm_infr_get_by_rec: Looking for Inform Record
Dec 08 18:17:02 269582 [B756ABB0] -> InformInfo dump:
                                gid.....................0x0000000000000000 : 0x0000000000000000
                                lid_range_begin.........0x0
                                lid_range_end...........0x0
                                is_generic..............0x0
                                subscribe...............0x0
                                trap_type...............0x0
                                dev_id..................0x0
                                qpn.....................0x000001
                                resp_time_val...........0x0
                                vendor_id...............0x000000
Dec 08 18:17:02 269625 [B756ABB0] -> osm_infr_get_by_rec: InformInfo list size 1
Dec 08 18:17:02 269650 [B756ABB0] -> __match_inf_rec: [
Dec 08 18:17:02 269673 [B756ABB0] -> __match_inf_rec: Differ by Address
Dec 08 18:17:02 269698 [B756ABB0] -> __match_inf_rec: ]
Dec 08 18:17:02 269724 [B756ABB0] -> osm_infr_get_by_rec: ]
Dec 08 18:17:02 269751 [B756ABB0] -> osm_infr_rcv_process_set_method: ERR 4307: Failed to UnSubscribe to non existing inform object

Dec 08 18:17:02 269914 [B756ABB0] -> SA MAD dump:
                                base_ver................0x1
                                mgmt_class..............0x3
                                class_ver...............0x2
                                method..................0x81 (SubnAdmGetResp)
                                status..................0x200
                                resv....................0x0
                                trans_id................0x360600000033
                                attr_id.................0x3 (InformInfo)

It looks like the OpenSM side fails on the following:
  if ( memcmp(&p_infr->report_addr,
              &p_infr_rec->report_addr,
              sizeof(p_infr_rec->report_addr)) )
  {
     osm_log( p_log, OSM_LOG_DEBUG,
              "__match_inf_rec: "
              "Differ by Address\n" );
     goto Exit;
  }

Not sure why that is. Guess it needs to be debugged...

> > 2. With the latest tree, do your simulation tests now work ? The
> > osm.fdbs UNREACHABLE was only a problem with the file and not with the
> > LFTs in the network.
> >   
> Yes they do.

Good.

> > 3. In terms of file format changes, the lack of any file versioning
> > makes it difficult to move these forward when the need arises. (The
> > format change to osm.mcfdbs was unintentional (not by design)).
> >   
> The issues until now were not that a file format change was required but 
> were unintentional.
> When we will have a real need to change file format I am sure we can 
> agree on adding version and change all parsers.

We will have a real need at some point. It is more likely the config
files but there may be more info to add to other files as well.

> > 4. I encourage you to look at and comment on the OpenSM patches rather
> > than waiting for them to be in the tree.
> >   
> I am sure you did not mean to, but now I have to admit my limited skills 
> in catching bugs by reading patches :-( .

Not just read, but they are there to try out as well.

> Instead on relying on bug reading I use automatic regression. I wish we 
> could agree on some regression that
> each developer will have to run before patches are committed to the trunk.

> On my side I would love to have an automatic way to include all the 
> patches posted (one at a time) run "dead or alive" check
> and provide feedback. Currently my automation is limited to testing the 
> trunk. So I will always be complaining after the patches are
> committed. I think this is the way most other components testing works.

You could try out the patches and do the same thing before they are
committed.

> What kind of regression suite do you and Sasha use?

Haven't we been over this before ? I might ask the same of you and
Yevgeny. There are similar occurrences.

I use osmtest for most of my testing as well as as a subnet on which I
perform directed tests on the functionality being changed.

Sasha does testing on both live and simulated subnets.

> Can we agree on minimal pre-commit testing?

I think we do a reasonable level of pre commit testing and have been
responsive to breakages not necessarily of our own making.

> Can we have a branch for that sake where all patches will first have to 
> go into for 2 days? (it will allow for pre-trunk testing).

That's why the patches go out first. The patches in question were out
there for over a week.

This seems like another level of overhead to me. Is there real gain here
?

-- Hal

> > Thanks for your help in finding the bugs sooner.
> >
> > -- Hal
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From halr at voltaire.com  Fri Dec  8 14:44:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 17:44:40 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <4579E333.4000901@3leafnetworks.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
Message-ID: <1165617878.26559.4952.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote:
> I have got the same problem with OFED 1.1 stack also, but the frequency 
> is less. I had to try 120 fail overs (by rebooting the highest priority 
> OpenSM server) before getting into this problem.

If I understand you correctly, you reboot the master SM and the standby
does not takeover (become master). Is that correct ?

Is this with 2 SMs or more ?

> At this state OpenSM doesn't update anything to the log files; 
> doesn't assign the LIDs to the other nodes; doesn't respond 
> to the multi cast join operations. Even another OpenSM is 
> started on another node with higher priority it can 
> not become the master. The only way to recover from this is by killing 
> the stuck OpenSM.

What SMLID do the nodes in the subnet point to ?

Can you determine where is it stuck ? Sounds like it could be in some
tight loop. Can you build with gdb and attach when this occurs to see ?

-- Hal

>  VBabu
> 
> Hal Rosenstock wrote:
> 
> >I don't see any explicit changes to the SM state machine which would
> >affect this but as I have mentioned before there are many bug fixes in
> >OFED 1.1. I can't conclusively state whether this would fix the issue
> >you see but would be in a much better position to try to figure this
> >out.
> >
> >-- Hal
> >
> >  
> >
> >> Hi
> >>
> >>   I have topology of two switches and a bunch of nodes, with each 
> >> node having 2port HCAs. Port1 of every node connects to switch1 and 
> >> Port2 of every node connects to switch2. So Port1 and Port2 are in 
> >> different subnets. So I am running one OpenSM (from OFED 1.0) for 
> >> each port on one node designated as a server. To guard against that 
> >> server going down I have another server node to run the OpenSM in 
> >> "standby" mode for each port. I will adjust the priorities such that 
> >> first server always has "master" OpenSM and second server has 
> >> "standby" OpenSM.
> >>
> >>    When the first server rebooted, "standby" OpenSM should takeover 
> >> the mastership. It usually works fine but sometimes it is failing to 
> >> takeover. In the following example OpenSM for Port1 failed to 
> >> takeover, but OpenSM for Port2 took over and became "master". The 
> >> OpenSM for Port1 seems be stuck in some weired state (strace shows 
> >> that it is sleeping). It is no longer assigning LIDs to the rest of 
> >> the nodes in the subnet and not responding to the broadcast joins. 
> >> The log file shows nothing from past 4 days. I have the complete log 
> >> files if needed.
> >>
> >>    Is this a known problem and fixed in OFED 1.1 ?
> >>
> >> [root at vortex3l-72 158]# ibv_devinfo
> >> hca_id: mthca0
> >>        fw_ver:                         5.1.400
> >>        node_guid:                      0050:4501:4b1a:0000
> >>        sys_image_guid:                 0050:4501:4b1a:0003
> >>        vendor_id:                      0x02c9
> >>        vendor_part_id:                 25218
> >>        hw_ver:                         0xA0
> >>        board_id:                       ARM0020000001
> >>        phys_port_cnt:                  2
> >>                port:   1
> >>                        state:                  PORT_ACTIVE (4)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 7
> >>                        port_lid:               1
> >>                        port_lmc:               0x00
> >>
> >>                port:   2
> >>                        state:                  PORT_ACTIVE (4)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 1
> >>                        port_lid:               1
> >>                        port_lmc:               0x00
> >>
> >> [root at vortex3l-72 158]# ps -aux | grep open
> >> Warning: bad syntax, perhaps a bogus '-'? See 
> >> /usr/share/doc/procps-3.2.3/FAQ
> >> root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
> >> /usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
> >> /var/log/opensm2.log
> >> root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
> >> /usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
> >> /var/log/opensm1.log
> >> root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
> >> [root at vortex3l-72 158]# strace -p7975
> >> Process 7975 attached - interrupt to quit
> >> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0}, NULL)                = 0
> >> nanosleep({10, 0},  <unfinished ...>
> >> Process 7975 detached
> >> [root at vortex3l-72 158]# uptime
> >> 12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
> >> [root at vortex3l-72 158]# date
> >> Mon Nov 27 12:13:05 PST 2006
> >> [root at vortex3l-72 158]#  tail /var/log/opensm1.log
> >> Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
> >> 3673M
> >>
> >> Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:66 from LID:0x0000 
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:66 from LID:0x0000 
> >> GID:0xfe80000000000000,0x0000000000000000
> >> Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >> 0x5045014b1a0001
> >> Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
> >>
> >> Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
> >>
> >> [root at vortex3l-72 158]#  tail /var/log/opensm2.log
> >>                                00 00 00 00 00 00 00 00   00 00 00 00 
> >> 00 00 00 00
> >>
> >> Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
> >> Generic Notice type:3 num:65 from LID:0x0001 
> >> GID:0xfe80000000000000,0x005045014b1a0002
> >> Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0002
> >> Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0003
> >> Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
> >> Cannot find destination port with LID:0x0004
> >> Nov 27 12:10:32 146382 [41401960] -> Removed port with 
> >> GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> >> Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
> >> Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
> >> light sweep sampling list
> >> Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
> >>                                Path = [0][2]
> >>


From greg.lindahl at qlogic.com  Fri Dec  8 15:36:16 2006
From: greg.lindahl at qlogic.com (Greg Lindahl)
Date: Fri, 8 Dec 2006 15:36:16 -0800
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <ada4ps79vh0.fsf@cisco.com>
References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>
	<ada4ps79vh0.fsf@cisco.com>
Message-ID: <20061208233616.GA10646@greglaptop>

On Thu, Dec 07, 2006 at 02:44:59PM -0800, Roland Dreier wrote:

> But you should also cope with
> non-OFED (vanilla upstream) drivers, probably by testing
> LINUX_VERSION_CODE too I suppose.

Although RHEL4 shows how this can break down in the future... they
backport kernel stuff while leaving LINUX_VERSION_CODE set to 2.6.9.

-- greg


From venkatesh.babu at 3leafnetworks.com  Fri Dec  8 15:44:38 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 08 Dec 2006 15:44:38 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165617878.26559.4952.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
Message-ID: <4579F8E6.3040604@3leafnetworks.com>


 I have 3 nodes and 2 IB switches. Port 1 of all nodes connected to 
switch 1 and Port2 of all nodes connected to switch 2. So each switch 
creates its own subnet and hence I have two instances of OpenSM for each 
port. I have two OpenSMs running with priority 1 on node1 and two 
OpenSM's running with priority 13 on node 2. Node 3 doesn't have any 
OpenSM's but just a OFED kernel modules. I reboot the node 2 every 
10minutes. Since it has the highest priority, every time it boots up it 
grabs the mastership from the node 1. It works most of the time, except 
when this problem occurs.

 When this problem occurs, node 3 shows the old/stale SMLID information. 
But if you reload the ofed drivers or reboot the node to get the new LID 
assignment it shows SMLID as 0. Even though Node 1's SMLID and port LID 
are same, it was not completely asserted the mastership. See the log 
messages below -

[root ~]# ibv_devinfo
...
                port:   1
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00


 The strace output is shown below -
[root~]# strace -p 7518
Process 7518 attached - interrupt to quit
restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x335d) = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0
nanosleep({10, 0}, NULL)                = 0

 The GDB output is shown below -
[root ~]# gdb /usr/bin/opensm 7518
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/bin/opensm, process 7518
Reading symbols from /usr/lib/libibumad.so.1...done.
Loaded symbols for /usr/lib/libibumad.so.1
Reading symbols from /usr/lib/libopensm.so.1...done.
Loaded symbols for /usr/lib/libopensm.so.1
Reading symbols from /usr/lib/libosmcomp.so.1...done.
Loaded symbols for /usr/lib/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182896213152 (LWP 7518)]
[New Thread 1136679264 (LWP 7544)]
[New Thread 1126189408 (LWP 7543)]
[New Thread 1115699552 (LWP 7542)]
[New Thread 1105209696 (LWP 7541)]
[New Thread 1094719840 (LWP 7540)]
[New Thread 1084229984 (LWP 7534)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/lib/libosmvendor.so.1...done.
Loaded symbols for /usr/lib/libosmvendor.so.1
Reading symbols from /usr/lib/libibcommon.so.1...done.
Loaded symbols for /usr/lib/libibcommon.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x00000031603bf368 in usleep () from /lib64/tls/libc.so.6
#2  0x000000316080df32 in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x000000000040584e in main ()
(gdb) print osm_hup_flag
$1 = 0
(gdb)


Following is the log output. It is entring to MASTER state. But it 
doesn't show"SUBNET UP" event. It gets stuck.

[root ~]#  tail /var/log/opensm1.log
Dec 04 15:59:35 573040 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3726M

Dec 04 15:59:35 783462 [9576BCA0] -> osm_report_notice: Reporting 
Generic Notice type:3 num:66 from LID:0x0000 
GID:0xfe80000000000000,0x0000000000000000
Dec 04 15:59:35 783541 [9576BCA0] -> osm_report_notice: Reporting 
Generic Notice type:3 num:66 from LID:0x0000 
GID:0xfe80000000000000,0x0000000000000000
Dec 04 15:59:35 783589 [9576BCA0] -> osm_vendor_bind: Binding to port 
0x5045014b1a0001
Dec 04 15:59:35 787924 [9576BCA0] -> osm_vendor_bind: Binding to port 
0x5045014b1a0001
Dec 04 15:59:35 800404 [0000] -> Entering STANDBY state

Dec 04 15:59:36 053784 [0000] -> Entering MASTER state


Hal Rosenstock wrote:

>On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote:
>  
>
>>I have got the same problem with OFED 1.1 stack also, but the frequency 
>>is less. I had to try 120 fail overs (by rebooting the highest priority 
>>OpenSM server) before getting into this problem.
>>    
>>
>
>If I understand you correctly, you reboot the master SM and the standby
>does not takeover (become master). Is that correct ?
>
>Is this with 2 SMs or more ?
>
>  
>
>>At this state OpenSM doesn't update anything to the log files; 
>>doesn't assign the LIDs to the other nodes; doesn't respond 
>>to the multi cast join operations. Even another OpenSM is 
>>started on another node with higher priority it can 
>>not become the master. The only way to recover from this is by killing 
>>the stuck OpenSM.
>>    
>>
>
>What SMLID do the nodes in the subnet point to ?
>
>Can you determine where is it stuck ? Sounds like it could be in some
>tight loop. Can you build with gdb and attach when this occurs to see ?
>
>-- Hal
>
>  
>
>> VBabu
>>
>>Hal Rosenstock wrote:
>>
>>    
>>
>>>I don't see any explicit changes to the SM state machine which would
>>>affect this but as I have mentioned before there are many bug fixes in
>>>OFED 1.1. I can't conclusively state whether this would fix the issue
>>>you see but would be in a much better position to try to figure this
>>>out.
>>>
>>>-- Hal
>>>
>>> 
>>>
>>>      
>>>
>>>>Hi
>>>>
>>>>  I have topology of two switches and a bunch of nodes, with each 
>>>>node having 2port HCAs. Port1 of every node connects to switch1 and 
>>>>Port2 of every node connects to switch2. So Port1 and Port2 are in 
>>>>different subnets. So I am running one OpenSM (from OFED 1.0) for 
>>>>each port on one node designated as a server. To guard against that 
>>>>server going down I have another server node to run the OpenSM in 
>>>>"standby" mode for each port. I will adjust the priorities such that 
>>>>first server always has "master" OpenSM and second server has 
>>>>"standby" OpenSM.
>>>>
>>>>   When the first server rebooted, "standby" OpenSM should takeover 
>>>>the mastership. It usually works fine but sometimes it is failing to 
>>>>takeover. In the following example OpenSM for Port1 failed to 
>>>>takeover, but OpenSM for Port2 took over and became "master". The 
>>>>OpenSM for Port1 seems be stuck in some weired state (strace shows 
>>>>that it is sleeping). It is no longer assigning LIDs to the rest of 
>>>>the nodes in the subnet and not responding to the broadcast joins. 
>>>>The log file shows nothing from past 4 days. I have the complete log 
>>>>files if needed.
>>>>
>>>>   Is this a known problem and fixed in OFED 1.1 ?
>>>>
>>>>[root at vortex3l-72 158]# ibv_devinfo
>>>>hca_id: mthca0
>>>>       fw_ver:                         5.1.400
>>>>       node_guid:                      0050:4501:4b1a:0000
>>>>       sys_image_guid:                 0050:4501:4b1a:0003
>>>>       vendor_id:                      0x02c9
>>>>       vendor_part_id:                 25218
>>>>       hw_ver:                         0xA0
>>>>       board_id:                       ARM0020000001
>>>>       phys_port_cnt:                  2
>>>>               port:   1
>>>>                       state:                  PORT_ACTIVE (4)
>>>>                       max_mtu:                2048 (4)
>>>>                       active_mtu:             2048 (4)
>>>>                       sm_lid:                 7
>>>>                       port_lid:               1
>>>>                       port_lmc:               0x00
>>>>
>>>>               port:   2
>>>>                       state:                  PORT_ACTIVE (4)
>>>>                       max_mtu:                2048 (4)
>>>>                       active_mtu:             2048 (4)
>>>>                       sm_lid:                 1
>>>>                       port_lid:               1
>>>>                       port_lmc:               0x00
>>>>
>>>>[root at vortex3l-72 158]# ps -aux | grep open
>>>>Warning: bad syntax, perhaps a bogus '-'? See 
>>>>/usr/share/doc/procps-3.2.3/FAQ
>>>>root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
>>>>/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
>>>>/var/log/opensm2.log
>>>>root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
>>>>/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
>>>>/var/log/opensm1.log
>>>>root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
>>>>[root at vortex3l-72 158]# strace -p7975
>>>>Process 7975 attached - interrupt to quit
>>>>restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0}, NULL)                = 0
>>>>nanosleep({10, 0},  <unfinished ...>
>>>>Process 7975 detached
>>>>[root at vortex3l-72 158]# uptime
>>>>12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
>>>>[root at vortex3l-72 158]# date
>>>>Mon Nov 27 12:13:05 PST 2006
>>>>[root at vortex3l-72 158]#  tail /var/log/opensm1.log
>>>>Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
>>>>3673M
>>>>
>>>>Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
>>>>Generic Notice type:3 num:66 from LID:0x0000 
>>>>GID:0xfe80000000000000,0x0000000000000000
>>>>Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
>>>>Generic Notice type:3 num:66 from LID:0x0000 
>>>>GID:0xfe80000000000000,0x0000000000000000
>>>>Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
>>>>0x5045014b1a0001
>>>>Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
>>>>0x5045014b1a0001
>>>>Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
>>>>
>>>>Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
>>>>
>>>>[root at vortex3l-72 158]#  tail /var/log/opensm2.log
>>>>                               00 00 00 00 00 00 00 00   00 00 00 00 
>>>>00 00 00 00
>>>>
>>>>Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
>>>>Generic Notice type:3 num:65 from LID:0x0001 
>>>>GID:0xfe80000000000000,0x005045014b1a0002
>>>>Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
>>>>Cannot find destination port with LID:0x0002
>>>>Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
>>>>Cannot find destination port with LID:0x0003
>>>>Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
>>>>Cannot find destination port with LID:0x0004
>>>>Nov 27 12:10:32 146382 [41401960] -> Removed port with 
>>>>GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
>>>>Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
>>>>Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
>>>>light sweep sampling list
>>>>Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
>>>>                               Path = [0][2]
>>>>
>>>>        
>>>>
>
>  
>


From halr at voltaire.com  Fri Dec  8 15:57:17 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 18:57:17 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <4579F8E6.3040604@3leafnetworks.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
Message-ID: <1165622233.26559.8108.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 18:44, Venkatesh Babu wrote:
>  I have 3 nodes and 2 IB switches. Port 1 of all nodes connected to 
> switch 1 and Port2 of all nodes connected to switch 2. So each switch 
> creates its own subnet and hence I have two instances of OpenSM for each 
> port. 

And the two switches are not connected to each other, right ?

> I have two OpenSMs running with priority 1 on node1 and two 
> OpenSM's running with priority 13 on node 2.

Do you set a different subnet prefix (other than the default on one) ?
Not sure if this matters yet in OpenIB but it might.

> Node 3 doesn't have any 
> OpenSM's but just a OFED kernel modules. I reboot the node 2 every 
> 10minutes. Since it has the highest priority, every time it boots up it 
> grabs the mastership from the node 1. It works most of the time, except 
> when this problem occurs.

Now I understand the scenario.

> When this problem occurs, node 3 shows the old/stale SMLID information. 
> But if you reload the ofed drivers or reboot the node to get the new LID 
> assignment it shows SMLID as 0.

That's consistent with the SM not really taking over. Just wanted to be
sure.

> Even though Node 1's SMLID and port LID 
> are same, it was not completely asserted the mastership.

OK.

> See the log messages below -
> 
> [root ~]# ibv_devinfo
> ...
>                 port:   1
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
> 
> 
> 
>  The strace output is shown below -
> [root~]# strace -p 7518
> Process 7518 attached - interrupt to quit
> restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x335d) = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> nanosleep({10, 0}, NULL)                = 0
> 
>  The GDB output is shown below -
> [root ~]# gdb /usr/bin/opensm 7518
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/bin/opensm, process 7518
> Reading symbols from /usr/lib/libibumad.so.1...done.
> Loaded symbols for /usr/lib/libibumad.so.1
> Reading symbols from /usr/lib/libopensm.so.1...done.
> Loaded symbols for /usr/lib/libopensm.so.1
> Reading symbols from /usr/lib/libosmcomp.so.1...done.
> Loaded symbols for /usr/lib/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182896213152 (LWP 7518)]
> [New Thread 1136679264 (LWP 7544)]
> [New Thread 1126189408 (LWP 7543)]
> [New Thread 1115699552 (LWP 7542)]
> [New Thread 1105209696 (LWP 7541)]
> [New Thread 1094719840 (LWP 7540)]
> [New Thread 1084229984 (LWP 7534)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/lib/libosmvendor.so.1...done.
> Loaded symbols for /usr/lib/libosmvendor.so.1
> Reading symbols from /usr/lib/libibcommon.so.1...done.
> Loaded symbols for /usr/lib/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x000000316038ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x00000031603bf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x000000316080df32 in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x000000000040584e in main ()
> (gdb) print osm_hup_flag
> $1 = 0
> (gdb)

That's the main thread. It's in the following loop:

    while( !osm_exit_flag ) {
      if (opt.console)
        osm_console(&osm);
      else
        cl_thread_suspend( 10000 );

      if (osm_hup_flag) {
        osm_hup_flag = 0;
        /* a HUP signal should only start a new heavy sweep */
        osm.subn.force_immediate_heavy_sweep = TRUE;
        osm_opensm_sweep( &osm );
      }

What about the other threads ? What are they doing ?

> Following is the log output. It is entring to MASTER state. But it 
> doesn't show"SUBNET UP" event. It gets stuck.

I wouldn't expect that given the problem your hitting. The SUBNET UP
only occurs once the heavy sweep is completed. That's not happening.

-- Hal

> [root ~]#  tail /var/log/opensm1.log
> Dec 04 15:59:35 573040 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 3726M
> 
> Dec 04 15:59:35 783462 [9576BCA0] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:66 from LID:0x0000 
> GID:0xfe80000000000000,0x0000000000000000
> Dec 04 15:59:35 783541 [9576BCA0] -> osm_report_notice: Reporting 
> Generic Notice type:3 num:66 from LID:0x0000 
> GID:0xfe80000000000000,0x0000000000000000
> Dec 04 15:59:35 783589 [9576BCA0] -> osm_vendor_bind: Binding to port 
> 0x5045014b1a0001
> Dec 04 15:59:35 787924 [9576BCA0] -> osm_vendor_bind: Binding to port 
> 0x5045014b1a0001
> Dec 04 15:59:35 800404 [0000] -> Entering STANDBY state
> 
> Dec 04 15:59:36 053784 [0000] -> Entering MASTER state
> 
> 
> 
> Hal Rosenstock wrote:
> 
> >On Fri, 2006-12-08 at 17:12, Venkatesh Babu wrote:
> >  
> >
> >>I have got the same problem with OFED 1.1 stack also, but the frequency 
> >>is less. I had to try 120 fail overs (by rebooting the highest priority 
> >>OpenSM server) before getting into this problem.
> >>    
> >>
> >
> >If I understand you correctly, you reboot the master SM and the standby
> >does not takeover (become master). Is that correct ?
> >
> >Is this with 2 SMs or more ?
> >
> >  
> >
> >>At this state OpenSM doesn't update anything to the log files; 
> >>doesn't assign the LIDs to the other nodes; doesn't respond 
> >>to the multi cast join operations. Even another OpenSM is 
> >>started on another node with higher priority it can 
> >>not become the master. The only way to recover from this is by killing 
> >>the stuck OpenSM.
> >>    
> >>
> >
> >What SMLID do the nodes in the subnet point to ?
> >
> >Can you determine where is it stuck ? Sounds like it could be in some
> >tight loop. Can you build with gdb and attach when this occurs to see ?
> >
> >-- Hal
> >
> >  
> >
> >> VBabu
> >>
> >>Hal Rosenstock wrote:
> >>
> >>    
> >>
> >>>I don't see any explicit changes to the SM state machine which would
> >>>affect this but as I have mentioned before there are many bug fixes in
> >>>OFED 1.1. I can't conclusively state whether this would fix the issue
> >>>you see but would be in a much better position to try to figure this
> >>>out.
> >>>
> >>>-- Hal
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>Hi
> >>>>
> >>>>  I have topology of two switches and a bunch of nodes, with each 
> >>>>node having 2port HCAs. Port1 of every node connects to switch1 and 
> >>>>Port2 of every node connects to switch2. So Port1 and Port2 are in 
> >>>>different subnets. So I am running one OpenSM (from OFED 1.0) for 
> >>>>each port on one node designated as a server. To guard against that 
> >>>>server going down I have another server node to run the OpenSM in 
> >>>>"standby" mode for each port. I will adjust the priorities such that 
> >>>>first server always has "master" OpenSM and second server has 
> >>>>"standby" OpenSM.
> >>>>
> >>>>   When the first server rebooted, "standby" OpenSM should takeover 
> >>>>the mastership. It usually works fine but sometimes it is failing to 
> >>>>takeover. In the following example OpenSM for Port1 failed to 
> >>>>takeover, but OpenSM for Port2 took over and became "master". The 
> >>>>OpenSM for Port1 seems be stuck in some weired state (strace shows 
> >>>>that it is sleeping). It is no longer assigning LIDs to the rest of 
> >>>>the nodes in the subnet and not responding to the broadcast joins. 
> >>>>The log file shows nothing from past 4 days. I have the complete log 
> >>>>files if needed.
> >>>>
> >>>>   Is this a known problem and fixed in OFED 1.1 ?
> >>>>
> >>>>[root at vortex3l-72 158]# ibv_devinfo
> >>>>hca_id: mthca0
> >>>>       fw_ver:                         5.1.400
> >>>>       node_guid:                      0050:4501:4b1a:0000
> >>>>       sys_image_guid:                 0050:4501:4b1a:0003
> >>>>       vendor_id:                      0x02c9
> >>>>       vendor_part_id:                 25218
> >>>>       hw_ver:                         0xA0
> >>>>       board_id:                       ARM0020000001
> >>>>       phys_port_cnt:                  2
> >>>>               port:   1
> >>>>                       state:                  PORT_ACTIVE (4)
> >>>>                       max_mtu:                2048 (4)
> >>>>                       active_mtu:             2048 (4)
> >>>>                       sm_lid:                 7
> >>>>                       port_lid:               1
> >>>>                       port_lmc:               0x00
> >>>>
> >>>>               port:   2
> >>>>                       state:                  PORT_ACTIVE (4)
> >>>>                       max_mtu:                2048 (4)
> >>>>                       active_mtu:             2048 (4)
> >>>>                       sm_lid:                 1
> >>>>                       port_lid:               1
> >>>>                       port_lmc:               0x00
> >>>>
> >>>>[root at vortex3l-72 158]# ps -aux | grep open
> >>>>Warning: bad syntax, perhaps a bogus '-'? See 
> >>>>/usr/share/doc/procps-3.2.3/FAQ
> >>>>root      7988  0.0  0.0 92784 1672 ?        Sl   Nov22   0:06 
> >>>>/usr/bin/opensm -g 0x005045014b1a0002 -p 13 -s 10 -u -f 
> >>>>/var/log/opensm2.log
> >>>>root      7975  0.0  0.0 92784 1572 ?        Sl   Nov22   0:06 
> >>>>/usr/bin/opensm -g 0x005045014b1a0001 -p 13 -s 10 -u -f 
> >>>>/var/log/opensm1.log
> >>>>root      7803  0.0  0.0 51096  668 pts/0    S+   12:11   0:00 grep open
> >>>>[root at vortex3l-72 158]# strace -p7975
> >>>>Process 7975 attached - interrupt to quit
> >>>>restart_syscall(0x7fbffff630, 0, 0, 0x7fbffff501, 0x130) = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0}, NULL)                = 0
> >>>>nanosleep({10, 0},  <unfinished ...>
> >>>>Process 7975 detached
> >>>>[root at vortex3l-72 158]# uptime
> >>>>12:13:02 up 4 days, 17:05,  5 users,  load average: 0.00, 0.00, 0.00
> >>>>[root at vortex3l-72 158]# date
> >>>>Mon Nov 27 12:13:05 PST 2006
> >>>>[root at vortex3l-72 158]#  tail /var/log/opensm1.log
> >>>>Nov 22 19:09:27 894295 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
> >>>>3673M
> >>>>
> >>>>Nov 22 19:09:28 164482 [9576BCA0] -> osm_report_notice: Reporting 
> >>>>Generic Notice type:3 num:66 from LID:0x0000 
> >>>>GID:0xfe80000000000000,0x0000000000000000
> >>>>Nov 22 19:09:28 164560 [9576BCA0] -> osm_report_notice: Reporting 
> >>>>Generic Notice type:3 num:66 from LID:0x0000 
> >>>>GID:0xfe80000000000000,0x0000000000000000
> >>>>Nov 22 19:09:28 164608 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >>>>0x5045014b1a0001
> >>>>Nov 22 19:09:28 167508 [9576BCA0] -> osm_vendor_bind: Binding to port 
> >>>>0x5045014b1a0001
> >>>>Nov 22 19:09:28 177285 [0000] -> Entering STANDBY state
> >>>>
> >>>>Nov 22 19:09:28 442435 [0000] -> Entering MASTER state
> >>>>
> >>>>[root at vortex3l-72 158]#  tail /var/log/opensm2.log
> >>>>                               00 00 00 00 00 00 00 00   00 00 00 00 
> >>>>00 00 00 00
> >>>>
> >>>>Nov 27 12:10:32 146325 [41401960] -> osm_report_notice: Reporting 
> >>>>Generic Notice type:3 num:65 from LID:0x0001 
> >>>>GID:0xfe80000000000000,0x005045014b1a0002
> >>>>Nov 27 12:10:32 146343 [41401960] -> __match_notice_to_inf_rec: 
> >>>>Cannot find destination port with LID:0x0002
> >>>>Nov 27 12:10:32 146358 [41401960] -> __match_notice_to_inf_rec: 
> >>>>Cannot find destination port with LID:0x0003
> >>>>Nov 27 12:10:32 146373 [41401960] -> __match_notice_to_inf_rec: 
> >>>>Cannot find destination port with LID:0x0004
> >>>>Nov 27 12:10:32 146382 [41401960] -> Removed port with 
> >>>>GUID:0x0002c9020020f5ae LID range [0x6,0x6] of node:sqaathlon03 HCA-1
> >>>>Nov 27 12:10:32 146400 [41401960] -> osm_drop_mgr_process: ERR 0108: 
> >>>>Unknown remote side for node 0x0002c9010d26bae0 port 11. Adding to 
> >>>>light sweep sampling list
> >>>>Nov 27 12:10:32 146420 [41401960] -> Directed Path Dump of 1 hop path:
> >>>>                               Path = [0][2]
> >>>>
> >>>>        
> >>>>
> >
> >  
> >


From venkatesh.babu at 3leafnetworks.com  Fri Dec  8 16:30:01 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 08 Dec 2006 16:30:01 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165622233.26559.8108.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
Message-ID: <457A0389.7030103@3leafnetworks.com>

Hal Rosenstock wrote:

>And the two switches are not connected to each other, right ?
>  
>
  Yes, the switches are not connected.

>Do you set a different subnet prefix (other than the default on one) ?
>Not sure if this matters yet in OpenIB but it might.
>  
>
 I don't know how to set subnet prefix. So it may be default one.

>That's the main thread. It's in the following loop:
>
>    while( !osm_exit_flag ) {
>      if (opt.console)
>        osm_console(&osm);
>      else
>        cl_thread_suspend( 10000 );
>
>      if (osm_hup_flag) {
>        osm_hup_flag = 0;
>        /* a HUP signal should only start a new heavy sweep */
>        osm.subn.force_immediate_heavy_sweep = TRUE;
>        osm_opensm_sweep( &osm );
>      }
>
>What about the other threads ? What are they doing ?
>  
>
  Yes. I got this. It was in this loop. I didn't realized there are 
other OpenSM threads running. I need to find that out.

>I wouldn't expect that given the problem your hitting. The SUBNET UP
>only occurs once the heavy sweep is completed. That's not happening.
>
>-- Hal
>  
>
   Is the heavy sweep supposed to happen after the failover ?

   Is there any documentaion on the OpenSM architecture and design ?

 VBabu


From halr at voltaire.com  Fri Dec  8 16:48:13 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 19:48:13 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <457A0389.7030103@3leafnetworks.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
Message-ID: <1165625283.26559.10270.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >And the two switches are not connected to each other, right ?
> >  
> >
>   Yes, the switches are not connected.
> 
> >Do you set a different subnet prefix (other than the default on one) ?
> >Not sure if this matters yet in OpenIB but it might.
> >  
> >
>  I don't know how to set subnet prefix.

In opensm.opts file:

# Subnet prefix used on this subnet
subnet_prefix 0xfe80000000000000

(that's the default one)

>  So it may be default one.
> 
> >That's the main thread. It's in the following loop:
> >
> >    while( !osm_exit_flag ) {
> >      if (opt.console)
> >        osm_console(&osm);
> >      else
> >        cl_thread_suspend( 10000 );
> >
> >      if (osm_hup_flag) {
> >        osm_hup_flag = 0;
> >        /* a HUP signal should only start a new heavy sweep */
> >        osm.subn.force_immediate_heavy_sweep = TRUE;
> >        osm_opensm_sweep( &osm );
> >      }
> >
> >What about the other threads ? What are they doing ?
> >  
> >
>   Yes. I got this. It was in this loop. I didn't realized there are 
> other OpenSM threads running. I need to find that out.

OK.

> >I wouldn't expect that given the problem your hitting. The SUBNET UP
> >only occurs once the heavy sweep is completed. That's not happening.
> >
> >-- Hal
> >  
> >
>    Is the heavy sweep supposed to happen after the failover ?

The standby after determining that the master is non responsive will go
back to discovering but in your configuration will find no other SM and
will go to master. I think it got that far.

Once it transitions to master, it does a heavy sweep to configure the
subnet. Something is stopping that from completing. I'm not sure what is
going wrong.

>    Is there any documentaion on the OpenSM architecture and design ?

Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
for what an SM is supposed to do.

-- Hal

>  VBabu


From venkatesh.babu at 3leafnetworks.com  Fri Dec  8 17:03:30 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 08 Dec 2006 17:03:30 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165625283.26559.10270.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
Message-ID: <457A0B62.2060501@3leafnetworks.com>

Now I hit another instance of the problem. Now I have more information.

Node1:
======

[root at vortex3l-71 ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0050:4501:4a5a:0000
        sys_image_guid:                 0050:4501:4a5a:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               7
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               4
                        port_lmc:               0x00

[root at vortex3l-71 ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
root      6774  0.0  0.0 92844 1684 ?        Sl   Dec07   0:06 
/usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f 
/var/log/opensm1.log
root     21537  0.0  0.4 64556 9276 ttyS0    S+   16:48   0:00 gdb 
/usr/local/ofed/bin/opensm 6787
root      6787  0.0  0.0 92844 1728 ?        Tl   Dec07   0:05 
/usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f 
/var/log/opensm2.log
root     22566  0.0  0.0 51072  692 pts/0    S+   16:53   0:00 grep open
[root at vortex3l-71 ~]# tail /var/log/opensm2.log

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

                                00 00 00 00 00 00 00 00   00 00 00 00 00 
00 00 00

Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error 
on MAD sized umad (Interrupted system call)
Dec 07 11:29:14 625421 [0000] -> Exiting SM

[root at vortex3l-71 ~]#
[root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /usr/local/ofed/bin/opensm, process 6787
Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182899544416 (LWP 6787)]
[New Thread 1157658976 (LWP 6797)]
[New Thread 1147169120 (LWP 6796)]
[New Thread 1136679264 (LWP 6795)]
[New Thread 1126189408 (LWP 6794)]
[New Thread 1115699552 (LWP 6793)]
[New Thread 1105209696 (LWP 6792)]
[New Thread 1094719840 (LWP 6791)]
[New Thread 1084229984 (LWP 6789)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) info threads
  9 Thread 1084229984 (LWP 6789)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  8 Thread 1094719840 (LWP 6791)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  7 Thread 1105209696 (LWP 6792)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  6 Thread 1115699552 (LWP 6793)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  5 Thread 1126189408 (LWP 6794)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  4 Thread 1136679264 (LWP 6795)  0x0000003858c088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  3 Thread 1147169120 (LWP 6796)  0x0000003858c08acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  2 Thread 1157658976 (LWP 6797)  0x0000003857fbcd22 in poll ()
   from /lib64/tls/libc.so.6
  1 Thread 182899544416 (LWP 6787)  0x0000003857f8ed65 in 
__nanosleep_nocancel
    () from /lib64/tls/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0  
0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0  
0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
#1  0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available.
) at src/umad.c:775
#2  0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not 
available.
) at src/umad.c:805
#3  0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50)
    at osm_vendor_ibumad.c:266
#4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
cl_thread.c:61
#5  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#6  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#7  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0  
0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
    wait_us=10000000, interruptible=1) at cl_event.c:181
#2  0x00000000004362dc in __osm_sm_sweeper ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x000000000044d771 in __osm_vl15_poller ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
cl_thread.c:61
#4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0  
0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157
#2  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
#3  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb)


Node 2:
======

[root at localhost ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0050:4501:4a9e:0000
        sys_image_guid:                 0050:4501:4a9e:0003
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       ARM0020000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               2
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 4
                        port_lid:               2
                        port_lmc:               0x00

[root at localhost ~]# ps -aux | grep open
Warning: bad syntax, perhaps a bogus '-'? See 
/usr/share/doc/procps-3.2.3/FAQ
root      6854  0.0  0.0 92844 1648 ?        Sl   16:12   0:00 
/usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f 
/var/log/opensm1.log
root     14005  0.0  0.4 64632 9312 ttyS0    S+   16:46   0:00 gdb 
/var/log/opensm2.log 6867
root      6867  0.0  0.0 92844 1536 ?        Tl   16:12   0:00 
/usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f 
/var/log/opensm2.log
root     16223  0.0  0.0 51060  680 pts/0    S+   16:56   0:00 grep open
[root at localhost ~]# tail /var/log/opensm2.log
Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: 
BFS through all port guids in the subnet ]
Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop 
Tables configured on all switches
Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: 
Received an invalid delete request on MGID: 0xff12401bffff0000 : 
0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002
Dec 07 05:15:07 677004 [0000] -> SUBNET UP

Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002
Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002
Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
method = SubnAdmSet, scope_state = 0x1, component mask = 
0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
Dec 07 11:29:03 817752 [0000] -> Exiting SM

[root at localhost ~]#
[root at localhost ~]# gdb /var/log/opensm2.log 6867
GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as 
"x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable 
format: File format not recognized

Attaching to process 6867
Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols 
found)...done.
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
Reading symbols from /lib64/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread 182899548512 (LWP 6867)]
[New Thread 1157658976 (LWP 6884)]
[New Thread 1147169120 (LWP 6883)]
[New Thread 1136679264 (LWP 6882)]
[New Thread 1126189408 (LWP 6881)]
[New Thread 1115699552 (LWP 6880)]
[New Thread 1105209696 (LWP 6879)]
[New Thread 1094719840 (LWP 6878)]
[New Thread 1084229984 (LWP 6869)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
Reading symbols from /lib64/tls/libc.so.6...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
0x00000032eec8ed65 in __nanosleep_nocancel ()
   from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) info threads
  9 Thread 1084229984 (LWP 6869)  0x00000032ef908acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  8 Thread 1094719840 (LWP 6878)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  7 Thread 1105209696 (LWP 6879)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  6 Thread 1115699552 (LWP 6880)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  5 Thread 1126189408 (LWP 6881)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  4 Thread 1136679264 (LWP 6882)  0x00000032ef9088da in 
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  3 Thread 1147169120 (LWP 6883)  0x00000032ef908acf in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
  2 Thread 1157658976 (LWP 6884)  0x00000032eecbcd22 in poll ()
   from /lib64/tls/libc.so.6
  1 Thread 182899548512 (LWP 6867)  0x00000032eec8ed65 in 
__nanosleep_nocancel
    () from /lib64/tls/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0  
0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
#1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
#2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
cl_thread.c:125
#3  0x0000000000405b71 in main ()
(gdb) thread 2
[Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0  
0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
(gdb) bt
#0  0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
#1  0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available.
) at src/umad.c:775
#2  0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not 
available.
) at src/umad.c:805
#3  0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50)
    at osm_vendor_ibumad.c:266
#4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
cl_thread.c:61
#5  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#6  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#7  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0  
0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
    wait_us=10000000, interruptible=1) at cl_event.c:181
#2  0x00000000004362dc in __osm_sm_sweeper ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x000000000044d771 in __osm_vl15_poller ()
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 6
[Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 7
[Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 8
[Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0  
0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
    wait_us=4294967295, interruptible=1) at cl_event.c:168
#2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
    at cl_threadpool.c:71
#3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
cl_thread.c:61
#4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#6  0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0  
0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
/lib64/tls/libpthread.so.0
(gdb) bt
#0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib64/tls/libpthread.so.0
#1  0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
#2  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
#3  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb)


Node 3:
======

[root at devsunj ~]# ibv_devinfo
hca_id: mthca0
        fw_ver:                         5.1.400
        node_guid:                      0002:c902:0020:ed58
        sys_image_guid:                 0002:c902:0020:ed5b
        vendor_id:                      0x02c9
        vendor_part_id:                 25218
        hw_ver:                         0xA0
        board_id:                       MT_0150000001
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 2
                        port_lid:               1
                        port_lmc:               0x00

                port:   2
                        state:                  PORT_INIT (2)
                        max_mtu:                2048 (4)
                        active_mtu:             512 (2)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00

[root at devsunj ~]#


Hal Rosenstock wrote:

>On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
>  
>
>>Hal Rosenstock wrote:
>>
>>    
>>
>>>And the two switches are not connected to each other, right ?
>>> 
>>>
>>>      
>>>
>>  Yes, the switches are not connected.
>>
>>    
>>
>>>Do you set a different subnet prefix (other than the default on one) ?
>>>Not sure if this matters yet in OpenIB but it might.
>>> 
>>>
>>>      
>>>
>> I don't know how to set subnet prefix.
>>    
>>
>
>In opensm.opts file:
>
># Subnet prefix used on this subnet
>subnet_prefix 0xfe80000000000000
>
>(that's the default one)
>
>  
>
>> So it may be default one.
>>
>>    
>>
>>>That's the main thread. It's in the following loop:
>>>
>>>   while( !osm_exit_flag ) {
>>>     if (opt.console)
>>>       osm_console(&osm);
>>>     else
>>>       cl_thread_suspend( 10000 );
>>>
>>>     if (osm_hup_flag) {
>>>       osm_hup_flag = 0;
>>>       /* a HUP signal should only start a new heavy sweep */
>>>       osm.subn.force_immediate_heavy_sweep = TRUE;
>>>       osm_opensm_sweep( &osm );
>>>     }
>>>
>>>What about the other threads ? What are they doing ?
>>> 
>>>
>>>      
>>>
>>  Yes. I got this. It was in this loop. I didn't realized there are 
>>other OpenSM threads running. I need to find that out.
>>    
>>
>
>OK.
>
>  
>
>>>I wouldn't expect that given the problem your hitting. The SUBNET UP
>>>only occurs once the heavy sweep is completed. That's not happening.
>>>
>>>-- Hal
>>> 
>>>
>>>      
>>>
>>   Is the heavy sweep supposed to happen after the failover ?
>>    
>>
>
>The standby after determining that the master is non responsive will go
>back to discovering but in your configuration will find no other SM and
>will go to master. I think it got that far.
>
>Once it transitions to master, it does a heavy sweep to configure the
>subnet. Something is stopping that from completing. I'm not sure what is
>going wrong.
>
>  
>
>>   Is there any documentaion on the OpenSM architecture and design ?
>>    
>>
>
>Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
>for what an SM is supposed to do.
>
>-- Hal
>
>  
>
>> VBabu
>>    
>>
>
>  
>


From halr at voltaire.com  Fri Dec  8 17:38:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 08 Dec 2006 20:38:48 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <457A0B62.2060501@3leafnetworks.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
Message-ID: <1165628315.26559.12385.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 20:03, Venkatesh Babu wrote:
> Now I hit another instance of the problem. Now I have more information.

Was this the same scenario or something different ?

> Node1:
> ======
> 
> [root at vortex3l-71 ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0050:4501:4a5a:0000

So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
that right ?

>         sys_image_guid:                 0050:4501:4a5a:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               7
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4
>                         port_lid:               4
>                         port_lmc:               0x00
> 
> [root at vortex3l-71 ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See 
> /usr/share/doc/procps-3.2.3/FAQ
> root      6774  0.0  0.0 92844 1684 ?        Sl   Dec07   0:06 
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0001 -p 1 -s 10 -u -f 
> /var/log/opensm1.log
> root     21537  0.0  0.4 64556 9276 ttyS0    S+   16:48   0:00 gdb 
> /usr/local/ofed/bin/opensm 6787
> root      6787  0.0  0.0 92844 1728 ?        Tl   Dec07   0:05 
> /usr/local/ofed/bin/opensm -g 0x005045014a5a0002 -p 1 -s 10 -u -f 
> /var/log/opensm2.log
> root     22566  0.0  0.0 51072  692 pts/0    S+   16:53   0:00 grep open
> [root at vortex3l-71 ~]# tail /var/log/opensm2.log
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
>                                 00 00 00 00 00 00 00 00   00 00 00 00 00 
> 00 00 00
> 
> Dec 07 11:29:14 623895 [45007960] -> umad_receiver: ERR 5404: recv error 
> on MAD sized umad (Interrupted system call)
> Dec 07 11:29:14 625421 [0000] -> Exiting SM

Does this correspond to when node 2 SM goes down, SM comes up, or
something else ? 

Not sure why OpenSM decides to exit (due to this error which should be
recoverable). It then fails to exit (hangs) as the other threads are not
terminated. 

Is osm_exit_flag set ? I presume it is but would like verification.
What are the thread_state values of the various threads ?

> [root at vortex3l-71 ~]#
> [root at vortex3l-71 ~]# gdb /usr/local/ofed/bin/opensm 6787
> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu"...
> (no debugging symbols found)
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> 
> Attaching to program: /usr/local/ofed/bin/opensm, process 6787
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899544416 (LWP 6787)]
> [New Thread 1157658976 (LWP 6797)]
> [New Thread 1147169120 (LWP 6796)]
> [New Thread 1136679264 (LWP 6795)]
> [New Thread 1126189408 (LWP 6794)]
> [New Thread 1115699552 (LWP 6793)]
> [New Thread 1105209696 (LWP 6792)]
> [New Thread 1094719840 (LWP 6791)]
> [New Thread 1084229984 (LWP 6789)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) info threads
>   9 Thread 1084229984 (LWP 6789)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 6791)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 6792)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 6793)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 6794)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 6795)  0x0000003858c088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 6796)  0x0000003858c08acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 6797)  0x0000003857fbcd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182899544416 (LWP 6787)  0x0000003857f8ed65 in 
> __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899544416 (LWP 6787))]#0  
> 0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857f8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x0000003857fbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6797))]#0  
> 0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x0000003857fbcd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a9588d90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a9588da2d in umad_recv (portid=Variable "portid" is not 
> available.
> ) at src/umad.c:805
> #3  0x0000002a9578367b in umad_receiver (p_ptr=0x5c2d50)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
> cl_thread.c:61
> #5  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6796))]#0  
> 0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c08acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x00000000004362dc in __osm_sm_sweeper ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6795))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d771 in __osm_vl15_poller ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6794))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6793))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6792))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6791))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
> cl_thread.c:61
> #4  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6789))]#0  
> 0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x0000003858c088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a95675991 in __cl_timer_prov_cb (context=0x0) at cl_timer.c:157
> #2  0x0000003858c060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x0000003857fc5b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb)
> 
> 
> Node 2:
> ======

Is this when node 2 comes back up and SM is restarted on both ports or
is it after the SM is stopped on port 2 ?

> [root at localhost ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0050:4501:4a9e:0000
>         sys_image_guid:                 0050:4501:4a9e:0003
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       ARM0020000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               2
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 4

This port still points at the SM on node 1, right ?

>                         port_lid:               2
>                         port_lmc:               0x00
> 
> [root at localhost ~]# ps -aux | grep open
> Warning: bad syntax, perhaps a bogus '-'? See 
> /usr/share/doc/procps-3.2.3/FAQ
> root      6854  0.0  0.0 92844 1648 ?        Sl   16:12   0:00 
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0001 -p 8 -s 10 -u -f 
> /var/log/opensm1.log
> root     14005  0.0  0.4 64632 9312 ttyS0    S+   16:46   0:00 gdb 
> /var/log/opensm2.log 6867
> root      6867  0.0  0.0 92844 1536 ?        Tl   16:12   0:00 
> /usr/local/ofed/bin/opensm -g 0x005045014a9e0002 -p 8 -s 10 -u -f 
> /var/log/opensm2.log
> root     16223  0.0  0.0 51060  680 pts/0    S+   16:56   0:00 grep open
> [root at localhost ~]# tail /var/log/opensm2.log
> Dec 07 05:15:07 675863 [41401960] -> osm_subn_set_up_down_min_hop_table: 
> BFS through all port guids in the subnet ]
> Dec 07 05:15:07 675898 [41401960] -> osm_ucast_mgr_process: Min Hop 
> Tables configured on all switches
> Dec 07 05:15:07 682095 [43204960] -> osm_mcmr_rcv_leave_mgrp: ERR 1B25: 
> Received an invalid delete request on MGID: 0xff12401bffff0000 : 
> 0x00000000ffffffff for PortGID: 0xfe80000000000000 : 0x0050450148ba0002
> Dec 07 05:15:07 677004 [0000] -> SUBNET UP
> 
> Dec 07 05:15:09 598888 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x005045014a9e0002
> Dec 07 07:26:17 429099 [42803960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x032e1480ffffffff from port 0x0050450148ba0002
> Dec 07 07:26:18 429309 [41E02960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: 
> method = SubnAdmSet, scope_state = 0x1, component mask = 
> 0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> 0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> Dec 07 11:29:03 817752 [0000] -> Exiting SM

You stopped this SM, right ?

> [root at localhost ~]#
> [root at localhost ~]# gdb /var/log/opensm2.log 6867

Why gdb this node's SM ? I'm not following you.

Should point at executable not log.

> GNU gdb Red Hat Linux (6.3.0.0-1.63rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as 
> "x86_64-redhat-linux-gnu"..."/var/log/opensm2.log": not in executable 
> format: File format not recognized
> 
> Attaching to process 6867
> Reading symbols from /usr/local/ofed/bin/opensm...(no debugging symbols 
> found)...done.
> Using host libthread_db library "/lib64/tls/libthread_db.so.1".
> Reading symbols from /usr/local/ofed/lib64/libopensm.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libopensm.so.1
> Reading symbols from /usr/local/ofed/lib64/libosmcomp.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmcomp.so.1
> Reading symbols from /lib64/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread 182899548512 (LWP 6867)]
> [New Thread 1157658976 (LWP 6884)]
> [New Thread 1147169120 (LWP 6883)]
> [New Thread 1136679264 (LWP 6882)]
> [New Thread 1126189408 (LWP 6881)]
> [New Thread 1115699552 (LWP 6880)]
> [New Thread 1105209696 (LWP 6879)]
> [New Thread 1094719840 (LWP 6878)]
> [New Thread 1084229984 (LWP 6869)]
> Loaded symbols for /lib64/tls/libpthread.so.0
> Reading symbols from /usr/local/ofed/lib64/libosmvendor.so.2...done.
> Loaded symbols for /usr/local/ofed/lib64/libosmvendor.so.2
> Reading symbols from /usr/local/ofed/lib64/libibumad.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibumad.so.1
> Reading symbols from /usr/local/ofed/lib64/libibcommon.so.1...done.
> Loaded symbols for /usr/local/ofed/lib64/libibcommon.so.1
> Reading symbols from /lib64/tls/libc.so.6...done.
> Loaded symbols for /lib64/tls/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> 0x00000032eec8ed65 in __nanosleep_nocancel ()
>    from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) info threads
>   9 Thread 1084229984 (LWP 6869)  0x00000032ef908acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   8 Thread 1094719840 (LWP 6878)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   7 Thread 1105209696 (LWP 6879)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   6 Thread 1115699552 (LWP 6880)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   5 Thread 1126189408 (LWP 6881)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   4 Thread 1136679264 (LWP 6882)  0x00000032ef9088da in 
> pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   3 Thread 1147169120 (LWP 6883)  0x00000032ef908acf in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/tls/libpthread.so.0
>   2 Thread 1157658976 (LWP 6884)  0x00000032eecbcd22 in poll ()
>    from /lib64/tls/libc.so.6
>   1 Thread 182899548512 (LWP 6867)  0x00000032eec8ed65 in 
> __nanosleep_nocancel
>     () from /lib64/tls/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 182899548512 (LWP 6867))]#0  
> 0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eec8ed65 in __nanosleep_nocancel () from /lib64/tls/libc.so.6
> #1  0x00000032eecbf368 in usleep () from /lib64/tls/libc.so.6
> #2  0x0000002a9567504e in cl_thread_suspend (pause_ms=10000) at 
> cl_thread.c:125
> #3  0x0000000000405b71 in main ()
> (gdb) thread 2
> [Switching to thread 2 (Thread 1157658976 (LWP 6884))]#0  
> 0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> (gdb) bt
> #0  0x00000032eecbcd22 in poll () from /lib64/tls/libc.so.6
> #1  0x0000002a9588e90d in dev_poll (fd=Variable "fd" is not available.
> ) at src/umad.c:775
> #2  0x0000002a9588ea2d in umad_recv (portid=Variable "portid" is not 
> available.
> ) at src/umad.c:805
> #3  0x0000002a9578467b in umad_receiver (p_ptr=0x5c2d50)
>     at osm_vendor_ibumad.c:266
> #4  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5c2dc0) at 
> cl_thread.c:61
> #5  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #6  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #7  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 1147169120 (LWP 6883))]#0  
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eb3b in cl_event_wait_on (p_event=0x588798,
>     wait_us=10000000, interruptible=1) at cl_event.c:181
> #2  0x00000000004362dc in __osm_sm_sweeper ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x588878) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 1136679264 (LWP 6882))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a258,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x000000000044d771 in __osm_vl15_poller ()
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58a2c8) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 1126189408 (LWP 6881))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x5900e0) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 6
> [Switching to thread 6 (Thread 1115699552 (LWP 6880))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x590010) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 7
> [Switching to thread 7 (Thread 1105209696 (LWP 6879))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58ff40) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 8
> [Switching to thread 8 (Thread 1094719840 (LWP 6878))]#0  
> 0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef9088da in pthread_cond_wait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a9566eaa9 in cl_event_wait_on (p_event=0x58a540,
>     wait_us=4294967295, interruptible=1) at cl_event.c:168
> #2  0x0000002a956750fa in __cl_thread_pool_routine (context=0x58a468)
>     at cl_threadpool.c:71
> #3  0x0000002a95674f6a in __cl_thread_wrapper (arg=0x58b760) at 
> cl_thread.c:61
> #4  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #5  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 9
> [Switching to thread 9 (Thread 1084229984 (LWP 6869))]#0  
> 0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 () from 
> /lib64/tls/libpthread.so.0
> (gdb) bt
> #0  0x00000032ef908acf in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>    from /lib64/tls/libpthread.so.0
> #1  0x0000002a956759cd in __cl_timer_prov_cb (context=0x0) at cl_timer.c:168
> #2  0x00000032ef9060aa in start_thread () from /lib64/tls/libpthread.so.0
> #3  0x00000032eecc5b43 in clone () from /lib64/tls/libc.so.6
> #4  0x0000000000000000 in ?? ()
> (gdb)
> 
> 
> Node 3:
> ======
> 
> [root at devsunj ~]# ibv_devinfo
> hca_id: mthca0
>         fw_ver:                         5.1.400
>         node_guid:                      0002:c902:0020:ed58
>         sys_image_guid:                 0002:c902:0020:ed5b
>         vendor_id:                      0x02c9
>         vendor_part_id:                 25218
>         hw_ver:                         0xA0
>         board_id:                       MT_0150000001
>         phys_port_cnt:                  2
>                 port:   1
>                         state:                  PORT_ACTIVE (4)
>                         max_mtu:                2048 (4)
>                         active_mtu:             2048 (4)
>                         sm_lid:                 2
>                         port_lid:               1
>                         port_lmc:               0x00
> 
>                 port:   2
>                         state:                  PORT_INIT (2)
>                         max_mtu:                2048 (4)
>                         active_mtu:             512 (2)
>                         sm_lid:                 0
>                         port_lid:               0
>                         port_lmc:               0x00
> 
> [root at devsunj ~]#
> 
> 
> 
> 
> Hal Rosenstock wrote:
> 
> >On Fri, 2006-12-08 at 19:30, Venkatesh Babu wrote:
> >  
> >
> >>Hal Rosenstock wrote:
> >>
> >>    
> >>
> >>>And the two switches are not connected to each other, right ?
> >>> 
> >>>
> >>>      
> >>>
> >>  Yes, the switches are not connected.
> >>
> >>    
> >>
> >>>Do you set a different subnet prefix (other than the default on one) ?
> >>>Not sure if this matters yet in OpenIB but it might.
> >>> 
> >>>
> >>>      
> >>>
> >> I don't know how to set subnet prefix.
> >>    
> >>
> >
> >In opensm.opts file:
> >
> ># Subnet prefix used on this subnet
> >subnet_prefix 0xfe80000000000000
> >
> >(that's the default one)
> >
> >  
> >
> >> So it may be default one.
> >>
> >>    
> >>
> >>>That's the main thread. It's in the following loop:
> >>>
> >>>   while( !osm_exit_flag ) {
> >>>     if (opt.console)
> >>>       osm_console(&osm);
> >>>     else
> >>>       cl_thread_suspend( 10000 );
> >>>
> >>>     if (osm_hup_flag) {
> >>>       osm_hup_flag = 0;
> >>>       /* a HUP signal should only start a new heavy sweep */
> >>>       osm.subn.force_immediate_heavy_sweep = TRUE;
> >>>       osm_opensm_sweep( &osm );
> >>>     }
> >>>
> >>>What about the other threads ? What are they doing ?
> >>> 
> >>>
> >>>      
> >>>
> >>  Yes. I got this. It was in this loop. I didn't realized there are 
> >>other OpenSM threads running. I need to find that out.
> >>    
> >>
> >
> >OK.
> >
> >  
> >
> >>>I wouldn't expect that given the problem your hitting. The SUBNET UP
> >>>only occurs once the heavy sweep is completed. That's not happening.
> >>>
> >>>-- Hal
> >>> 
> >>>
> >>>      
> >>>
> >>   Is the heavy sweep supposed to happen after the failover ?
> >>    
> >>
> >
> >The standby after determining that the master is non responsive will go
> >back to discovering but in your configuration will find no other SM and
> >will go to master. I think it got that far.
> >
> >Once it transitions to master, it does a heavy sweep to configure the
> >subnet. Something is stopping that from completing. I'm not sure what is
> >going wrong.
> >
> >  
> >
> >>   Is there any documentaion on the OpenSM architecture and design ?
> >>    
> >>
> >
> >Just the code AFAIK. You can read the SM and SA sections of IBA volume 1
> >for what an SM is supposed to do.
> >
> >-- Hal
> >
> >  
> >
> >> VBabu
> >>    
> >>
> >
> >  
> >


From venkatesh.babu at 3leafnetworks.com  Fri Dec  8 18:25:20 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Fri, 08 Dec 2006 18:25:20 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165628315.26559.12385.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
	<1165628315.26559.12385.camel@hal.voltaire.com>
Message-ID: <457A1E90.5040606@3leafnetworks.com>


Hal Rosenstock wrote:

>Was this the same scenario or something different ?
>  
>
I had killed the previous OpenSM instance. So I lost that information.
It is the same OpenSM failover issue and using the exact same setup and 
scripts to reproduce. It another instance of the problem. 

>So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
>that right ?
>
>  
>
  Yes, that is right. They are the OUI vendors for the IB HCAs.

>Does this correspond to when node 2 SM goes down, SM comes up, or
>something else ? 
>  
>
  I don't know the exact sequence when this message is displayed. All I 
can say is that it was the last message printed by the OpenSM. I am not 
rebooting the node 1 or  killing the OpenSM.  It is staying constant.
  I have a script to reboot node 2 every couple of minutes.  It  will 
stop rebooting  if it finds one of these conditions -
1.  SM1 on port1 is master but SM2 on port2 is not master
2. SM2 on port2 is master but SM1 on port1 is not master
3. Port1/2 is not ACTIVE
4. Port1/2's sm_lid/port lid is zero

  I am capturing this all the output at the end of the test when the 
script was terminated.

>Not sure why OpenSM decides to exit (due to this error which should be
>recoverable). It then fails to exit (hangs) as the other threads are not
>terminated. 
>
>Is osm_exit_flag set ? I presume it is but would like verification.
>What are the thread_state values of the various threads ?
>  
>
  Unfortunately someone powerd off Node1, while I was debugging. So I 
can not findout this.

  On Node2 :
(gdb) p osm_exit_flag
$1 = 0

  How do I findout the thread_state value ?

>>Node 2:
>>======
>>    
>>
>
>Is this when node 2 comes back up and SM is restarted on both ports or
>is it after the SM is stopped on port 2 ?
>
>  
>
   As I said earlier, this is the snapshot when the script is stopped 
rebooting as I described above.

>>                port:   2
>>                        state:                  PORT_INIT (2)
>>                        max_mtu:                2048 (4)
>>                        active_mtu:             2048 (4)
>>                        sm_lid:                 4
>>    
>>
>
>This port still points at the SM on node 1, right ?
>  
>
   Yes that is right.

>  
>
>>                        port_lid:               2
>>                        port_lmc:               0x00
>>
>>
>>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
>>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
>>Dec 07 11:29:03 817752 [0000] -> Exiting SM
>>    
>>
>
>You stopped this SM, right ?
>  
>
  No I didn't stop the SM.

>>[root at localhost ~]#
>>[root at localhost ~]# gdb /var/log/opensm2.log 6867
>>    
>>
>
>Why gdb this node's SM ? I'm not following you.
>
>Should point at executable not log.
>  
>
  You are right. It is a cut and paste error.

   VBabu


From halr at voltaire.com  Sat Dec  9 04:12:39 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Dec 2006 07:12:39 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <457A1E90.5040606@3leafnetworks.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
	<1165628315.26559.12385.camel@hal.voltaire.com>
	<457A1E90.5040606@3leafnetworks.com>
Message-ID: <1165666352.26559.39788.camel@hal.voltaire.com>

On Fri, 2006-12-08 at 21:25, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >Was this the same scenario or something different ?
> >  
> >
> I had killed the previous OpenSM instance. So I lost that information.
> It is the same OpenSM failover issue and using the exact same setup and 
> scripts to reproduce. It another instance of the problem. 
> 
> >So your OUI is 0x005045 ? That appears to be registered to Rioworks. Is
> >that right ?
> >
> >  
> >
>   Yes, that is right. They are the OUI vendors for the IB HCAs.
> 
> >Does this correspond to when node 2 SM goes down, SM comes up, or
> >something else ? 
> >  
> >
>   I don't know the exact sequence when this message is displayed. All I 
> can say is that it was the last message printed by the OpenSM. I am not 
> rebooting the node 1 or  killing the OpenSM.  It is staying constant.
>   I have a script to reboot node 2 every couple of minutes.  It  will 
> stop rebooting  if it finds one of these conditions -
> 1.  SM1 on port1 is master but SM2 on port2 is not master
> 2. SM2 on port2 is master but SM1 on port1 is not master
> 3. Port1/2 is not ACTIVE
> 4. Port1/2's sm_lid/port lid is zero

Understood.

>   I am capturing this all the output at the end of the test when the 
> script was terminated.
> 
> >Not sure why OpenSM decides to exit (due to this error which should be
> >recoverable). It then fails to exit (hangs) as the other threads are not
> >terminated. 
> >
> >Is osm_exit_flag set ? I presume it is but would like verification.
> >What are the thread_state values of the various threads ?
> >  
> >
>   Unfortunately someone powerd off Node1, while I was debugging. So I 
> can not findout this.
> 
>   On Node2 :
> (gdb) p osm_exit_flag
> $1 = 0

I was interested in the one on Node1 when it appeared to be trying to
exit (which it shouldn't be but is) and the other threads don't seem to
terminate.

>   How do I findout the thread_state value ?

It's a variable in the SM structure (in the SM thread).

> >>Node 2:
> >>======
> >>    
> >>
> >
> >Is this when node 2 comes back up and SM is restarted on both ports or
> >is it after the SM is stopped on port 2 ?
> >
> >  
> >
>    As I said earlier, this is the snapshot when the script is stopped 
> rebooting as I described above.
> 
> >>                port:   2
> >>                        state:                  PORT_INIT (2)
> >>                        max_mtu:                2048 (4)
> >>                        active_mtu:             2048 (4)
> >>                        sm_lid:                 4
> >>    
> >>
> >
> >This port still points at the SM on node 1, right ?
> >  
> >
>    Yes that is right.
> 
> >  
> >
> >>                        port_lid:               2
> >>                        port_lmc:               0x00
> >>
> >>
> >>0x0000000000010083, expected comp mask = 0x00000000000130c7, MGID: 
> >>0xffffffffffff0000 : 0x0000000000000000 from port 0x0050450148ba0002
> >>Dec 07 11:29:03 817752 [0000] -> Exiting SM
> >>    
> >>
> >
> >You stopped this SM, right ?
> >  
> >
>   No I didn't stop the SM.
> 
> >>[root at localhost ~]#
> >>[root at localhost ~]# gdb /var/log/opensm2.log 6867
> >>    
> >>
> >
> >Why gdb this node's SM ? I'm not following you.
> >
> >Should point at executable not log.
> >  
> >
>   You are right. It is a cut and paste error.

One more thing:

When you upgraded to OFED 1.2, did you build and install the management
libraries (libibcommon, libibumad are important here and libibmad for
diags) ?

-- Hal

> 
>    VBabu


From halr at voltaire.com  Sat Dec  9 05:48:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Dec 2006 08:48:28 -0500
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165666352.26559.39788.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
	<1165628315.26559.12385.camel@hal.voltaire.com>
	<457A1E90.5040606@3leafnetworks.com>
	<1165666352.26559.39788.camel@hal.voltaire.com>
Message-ID: <1165672098.26559.43885.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 07:12, Hal Rosenstock wrote:
> One more thing:
> 
> When you upgraded to OFED 1.2, did you build and install the management
> libraries (libibcommon, libibumad are important here and libibmad for
> diags) ?

Does the problem always occur on the "second" subnet (port 2's subnet)
or does it ever occur on port 1's subnet ?

Can you totally not configure the "port 1" subnet on all machines (and
OpenSM on the port 1's where that runs) and see if it is reproducible ?

Thanks.

-- Hal


From eitan at mellanox.co.il  Sat Dec  9 06:13:01 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 9 Dec 2006 16:13:01 +0200
Subject: [openib-general] [PATCH] osm: Routing Tables are full of
 UNREACHABLE instead of real route
Message-ID: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com>

Hi Sasha,

Your proposal for moving all "dump" files generation to end of sweep - 
just before "SUBNET UP" is reported - makes perfect sense to me.

But it is a bit lower in priority to the rest of the stuff.
Not sure if it worth tackling right now.

Eitan

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> Sent: Friday, December 08, 2006 11:55 PM
> To: Eitan Zahavi
> Cc: Hal Rosenstock; Yevgeny Kliteynik; OPENIB GENERAL
> Subject: Re: [PATCH] osm: Routing Tables are full of UNREACHABLE
instead of
> real route
> 
> Hi Eitan,
> 
> On 17:12 Thu 07 Dec     , Eitan Zahavi wrote:
> > Hi Hal,
> >
> > I resolved the mystery behind the osm.fdbs that is now full of
> > UNREACHABLE instead of correct out ports.
> >
> > The problem is a consequence of the new code that does not use the
> > switch LFT blocks for the intermediate LFT assignments:
> > The idea of having incremental updates only relies on temporary
buffer
> > that the routing algorithm fills.
> > Then it is sent to the wire only if there is a diff between the
switch
> > LFT tables (from the SMDB) and the temporary buffer.
> >
> > So the switch LFT tables are not being directly updated by the
routing
> > algorithm - but only by the GetResp obtained as reply to the
setting.
> > Until this stage of the description - everything looks right.
> >
> > But what is wrong is that the dump of LFT tables is invoked before
the
> > GetResp is obtained.
> > So if only a single sweep is invoked the resulting osm.fdbs show the
> > original state of the SMDB tables whicg is full of 0xFF =
UNREACHABLE.
> 
> Right.
> 
> >
> > The patch below is taking the easy way and should be probably
revisited.
> > Instead of having a separate algorithm step for dumping out the
> > resulting GetResp data after all LFT responses were obtained it just
> > copies the sent LFT blocks to the SMDB.
> 
> Would not this be better just to move all dumps at end of the OpenSM
heavy
> sweep. This should be simple, right?
> 
> Sasha
> 
> >
> > I think we need to have at least this simple patch until we have the
> > dump move to a new algorithm step.
> >
> > Thanks
> > Eitan
> >
> > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> >
> ================================================================
> =====
> >
> > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> > index 5a55da8..3a62c7f 100644
> > --- a/osm/opensm/osm_ucast_mgr.c
> > +++ b/osm/opensm/osm_ucast_mgr.c
> > @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table(
> >                "osm_ucast_mgr_set_fwd_table: ERR 3A05: "
> >                "Sending linear fwd. tbl. block failed (%s)\n",
> >                ib_get_err_str( status ) );
> > -    }
> > +    } else {
> > +       /*
> > +         HACK: for now we will assume we succeeded to send
> > +         and set the local DB based on it. This should allow
> > +         us to immediatly dump out our routing
> > +       */
> > +       osm_switch_set_ft_block(
> > +          p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho);
> > +        }
> >   }
> >
> >   OSM_LOG_EXIT( p_mgr->p_log );
> >


From eitan at mellanox.co.il  Sat Dec  9 06:26:55 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 09 Dec 2006 16:26:55 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061208221001.GG9193@sashak.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
Message-ID: <457AC7AF.5090202@mellanox.co.il>

Hi Sasha,

 
Without another devel branch I will not be able to test patches before 
the make it into the trunk.

I do not know how to make an automatic mail extraction into patches into 
tree such that I can have automatic patch check.

 
I am not a great fan of a new branch too.

So we need to agree that regression runs resulting with bug reporting 
post commit to the trunk is our mode of work.

I do not have a big issue with this (but it is more work for Hal).

Eitan

Sasha Khapyorsky wrote:
> On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
>   
>> Instead on relying on bug reading I use automatic regression. I wish we 
>> could agree on some regression that
>> each developer will have to run before patches are committed to the trunk.
>> On my side I would love to have an automatic way to include all the 
>> patches posted (one at a time) run "dead or alive" check
>> and provide feedback. Currently my automation is limited to testing the 
>> trunk. So I will always be complaining after the patches are
>> committed. I think this is the way most other components testing works.
>>
>> What kind of regression suite do you and Sasha use?
>>     
>
> On my side it clearly depends from kind of changes. In general I would
> call this "uni-testing".
>
>   
>> Can we agree on minimal pre-commit testing?
>> Can we have a branch for that sake where all patches will first have to 
>> go into for 2 days? (it will allow for pre-trunk testing).
>>     
>
> One more development branch? Will you test (or even see) this? If so I
> can publish the "fresh" tree.
>
> Sasha
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Sat Dec  9 06:35:10 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 09 Dec 2006 16:35:10 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165617195.26559.4435.camel@hal.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<1165617195.26559.4435.camel@hal.voltaire.com>
Message-ID: <457AC99E.8050402@mellanox.co.il>

Hal Rosenstock wrote:
> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
>   
>> Hal Rosenstock wrote:
>>     
>>> Hi Eitan,
>>>
>>> Just wanted to close the loop on the OpenSM issues of the last couple
>>> days.
>>>
>>> 1. When can you supply an OpenSM verbose log for the InformInfo
>>> subscribe problem you reported earlier today ? Failing that, I don't
>>> know how to reproduce this.
>>>   
>>>       
>> Attached
>>     
I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure.

>>> 4. I encourage you to look at and comment on the OpenSM patches rather
>>> than waiting for them to be in the tree.
>>>   
>>>       
>> I am sure you did not mean to, but now I have to admit my limited skills 
>> in catching bugs by reading patches :-( .
>>     
>
> Not just read, but they are there to try out as well.
>   
I will need an automatic flow for that sake. I can not keep up with the 
amount of patches manually.
But I do not know how to automatically convert the mails into patches 
into a tree.
> You could try out the patches and do the same thing before they are
> committed.
>
>   
I have automation based on the committed tree that pull it (git trem) , 
compile and run regression.
Actually this is how all other code is handled too.


From sashak at voltaire.com  Sat Dec  9 09:46:07 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 19:46:07 +0200
Subject: [openib-general] [PATCH] osm: Routing Tables are full of
 UNREACHABLE instead of real route
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C976D272@mtlexch01.mtl.com>
Message-ID: <20061209174607.GK10000@sashak.voltaire.com>

On 16:13 Sat 09 Dec     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> Your proposal for moving all "dump" files generation to end of sweep - 
> just before "SUBNET UP" is reported - makes perfect sense to me.
> 
> But it is a bit lower in priority to the rest of the stuff.
> Not sure if it worth tackling right now.

Ok, I may do this. This should not be big deal.

Sasha

> 
> Eitan
> 
> Eitan Zahavi
> Senior Engineering Director, Software Architect
> Mellanox Technologies LTD
> Tel:+972-4-9097208
> Fax:+972-4-9593245
> P.O. Box 586 Yokneam 20692 ISRAEL
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > Sent: Friday, December 08, 2006 11:55 PM
> > To: Eitan Zahavi
> > Cc: Hal Rosenstock; Yevgeny Kliteynik; OPENIB GENERAL
> > Subject: Re: [PATCH] osm: Routing Tables are full of UNREACHABLE
> instead of
> > real route
> > 
> > Hi Eitan,
> > 
> > On 17:12 Thu 07 Dec     , Eitan Zahavi wrote:
> > > Hi Hal,
> > >
> > > I resolved the mystery behind the osm.fdbs that is now full of
> > > UNREACHABLE instead of correct out ports.
> > >
> > > The problem is a consequence of the new code that does not use the
> > > switch LFT blocks for the intermediate LFT assignments:
> > > The idea of having incremental updates only relies on temporary
> buffer
> > > that the routing algorithm fills.
> > > Then it is sent to the wire only if there is a diff between the
> switch
> > > LFT tables (from the SMDB) and the temporary buffer.
> > >
> > > So the switch LFT tables are not being directly updated by the
> routing
> > > algorithm - but only by the GetResp obtained as reply to the
> setting.
> > > Until this stage of the description - everything looks right.
> > >
> > > But what is wrong is that the dump of LFT tables is invoked before
> the
> > > GetResp is obtained.
> > > So if only a single sweep is invoked the resulting osm.fdbs show the
> > > original state of the SMDB tables whicg is full of 0xFF =
> UNREACHABLE.
> > 
> > Right.
> > 
> > >
> > > The patch below is taking the easy way and should be probably
> revisited.
> > > Instead of having a separate algorithm step for dumping out the
> > > resulting GetResp data after all LFT responses were obtained it just
> > > copies the sent LFT blocks to the SMDB.
> > 
> > Would not this be better just to move all dumps at end of the OpenSM
> heavy
> > sweep. This should be simple, right?
> > 
> > Sasha
> > 
> > >
> > > I think we need to have at least this simple patch until we have the
> > > dump move to a new algorithm step.
> > >
> > > Thanks
> > > Eitan
> > >
> > > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > >
> > ================================================================
> > =====
> > >
> > > diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> > > index 5a55da8..3a62c7f 100644
> > > --- a/osm/opensm/osm_ucast_mgr.c
> > > +++ b/osm/opensm/osm_ucast_mgr.c
> > > @@ -982,7 +982,15 @@ osm_ucast_mgr_set_fwd_table(
> > >                "osm_ucast_mgr_set_fwd_table: ERR 3A05: "
> > >                "Sending linear fwd. tbl. block failed (%s)\n",
> > >                ib_get_err_str( status ) );
> > > -    }
> > > +    } else {
> > > +       /*
> > > +         HACK: for now we will assume we succeeded to send
> > > +         and set the local DB based on it. This should allow
> > > +         us to immediatly dump out our routing
> > > +       */
> > > +       osm_switch_set_ft_block(
> > > +          p_sw, p_mgr->lft_buf + block_id_ho * 64, block_id_ho);
> > > +        }
> > >   }
> > >
> > >   OSM_LOG_EXIT( p_mgr->p_log );
> > >


From sashak at voltaire.com  Sat Dec  9 10:01:01 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 20:01:01 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457AC7AF.5090202@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
Message-ID: <20061209180101.GL10000@sashak.voltaire.com>

Hi Eitan,

On 16:26 Sat 09 Dec     , Eitan Zahavi wrote:
> 
> Without another devel branch I will not be able to test patches before 
> the make it into the trunk.
> 
> I do not know how to make an automatic mail extraction into patches into 
> tree such that I can have automatic patch check.

You can just pipe emails with patches to git-am (manually after review
or automatically via procmail), so this will be committed in the local
tree/branch as you want.

> I am not a great fan of a new branch too.
> 
> So we need to agree that regression runs resulting with bug reporting 
> post commit to the trunk is our mode of work.

It is ok for me. At least as start point, if we will have automatic
nightly regression tests for the trunk it is just fine. If this will
work, and after collecting some experience we may think about
"quarantine" branch/tree and the regression testing expansion.

> I do not have a big issue with this (but it is more work for Hal).

Hal, what do you say?

Sasha

> 
> Eitan
> 
> Sasha Khapyorsky wrote:
> >On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
> >  
> >>Instead on relying on bug reading I use automatic regression. I wish we 
> >>could agree on some regression that
> >>each developer will have to run before patches are committed to the trunk.
> >>On my side I would love to have an automatic way to include all the 
> >>patches posted (one at a time) run "dead or alive" check
> >>and provide feedback. Currently my automation is limited to testing the 
> >>trunk. So I will always be complaining after the patches are
> >>committed. I think this is the way most other components testing works.
> >>
> >>What kind of regression suite do you and Sasha use?
> >>    
> >
> >On my side it clearly depends from kind of changes. In general I would
> >call this "uni-testing".
> >
> >  
> >>Can we agree on minimal pre-commit testing?
> >>Can we have a branch for that sake where all patches will first have to 
> >>go into for 2 days? (it will allow for pre-trunk testing).
> >>    
> >
> >One more development branch? Will you test (or even see) this? If so I
> >can publish the "fresh" tree.
> >
> >Sasha
> >
> >_______________________________________________
> >openib-general mailing list
> >openib-general at openib.org
> >http://openib.org/mailman/listinfo/openib-general
> >
> >To unsubscribe, please visit 
> >http://openib.org/mailman/listinfo/openib-general
> >  
> 


From sashak at voltaire.com  Sat Dec  9 10:03:44 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 20:03:44 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457AC99E.8050402@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<1165617195.26559.4435.camel@hal.voltaire.com>
	<457AC99E.8050402@mellanox.co.il>
Message-ID: <20061209180344.GM10000@sashak.voltaire.com>

On 16:35 Sat 09 Dec     , Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> Hi Eitan,
> >>>
> >>> Just wanted to close the loop on the OpenSM issues of the last couple
> >>> days.
> >>>
> >>> 1. When can you supply an OpenSM verbose log for the InformInfo
> >>> subscribe problem you reported earlier today ? Failing that, I don't
> >>> know how to reproduce this.
> >>>   
> >>>       
> >> Attached
> >>     
> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure.
> 
> >>> 4. I encourage you to look at and comment on the OpenSM patches rather
> >>> than waiting for them to be in the tree.
> >>>   
> >>>       
> >> I am sure you did not mean to, but now I have to admit my limited skills 
> >> in catching bugs by reading patches :-( .
> >>     
> >
> > Not just read, but they are there to try out as well.
> >   
> I will need an automatic flow for that sake. I can not keep up with the 
> amount of patches manually.
> But I do not know how to automatically convert the mails into patches 
> into a tree.

As stated in other post with git it is simple - git-am applies mbox just
fine.

Sasha

> > You could try out the patches and do the same thing before they are
> > committed.
> >
> >   
> I have automation based on the committed tree that pull it (git trem) , 
> compile and run regression.
> Actually this is how all other code is handled too.
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From sashak at voltaire.com  Sat Dec  9 10:11:37 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 20:11:37 +0200
Subject: [openib-general] userspace git conversion status/cut over
In-Reply-To: <20061206082242.GI26787@mellanox.co.il>
References: <1164897683.11808.129709.camel@hal.voltaire.com>
	<456F0AE3.4060209@ichips.intel.com>
	<20061130191717.GJ18978@sashak.voltaire.com>
	<20061206082242.GI26787@mellanox.co.il>
Message-ID: <20061209181137.GO10000@sashak.voltaire.com>

On 10:22 Wed 06 Dec     , Michael S. Tsirkin wrote:
> > Other issue. There is /pub/scm/linux-2.6.18/.git tree, looks it was used
> > for git installation testing or so.
> > 
> > Does somebody use it? Could this be (re)moved?
> 
> No one seemed to care, and 2.6.19 is out anyway :)
> Let's kill it then.

Ok, moved this out.

Sasha


From halr at voltaire.com  Sat Dec  9 10:20:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Dec 2006 13:20:18 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061209180101.GL10000@sashak.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
Message-ID: <1165688413.26559.55471.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote:
> Hi Eitan,
> 
> On 16:26 Sat 09 Dec     , Eitan Zahavi wrote:
> > 
> > Without another devel branch I will not be able to test patches before 
> > the make it into the trunk.
> > 
> > I do not know how to make an automatic mail extraction into patches into 
> > tree such that I can have automatic patch check.
> 
> You can just pipe emails with patches to git-am (manually after review
> or automatically via procmail), so this will be committed in the local
> tree/branch as you want.
> 
> > I am not a great fan of a new branch too.
> > 
> > So we need to agree that regression runs resulting with bug reporting 
> > post commit to the trunk is our mode of work.
> 
> It is ok for me. At least as start point, if we will have automatic
> nightly regression tests for the trunk it is just fine. If this will
> work, and after collecting some experience we may think about
> "quarantine" branch/tree and the regression testing expansion.
> 
> > I do not have a big issue with this (but it is more work for Hal).
> 
> Hal, what do you say?

What is the nightly regression and who will run it ?

It seems to me that the patches could be automated or a manual procedure
can be put in place so I am not keen on maintaining a pre-trunk branch
but would if I am convinced it can't be done easily by the methods I
mentioned, that the regression would be run nightly on a continuing
basis, and that reports would be issued based on the runs (to interested
parties).

-- Hal

> Sasha
> 
> > 
> > Eitan
> > 
> > Sasha Khapyorsky wrote:
> > >On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
> > >  
> > >>Instead on relying on bug reading I use automatic regression. I wish we 
> > >>could agree on some regression that
> > >>each developer will have to run before patches are committed to the trunk.
> > >>On my side I would love to have an automatic way to include all the 
> > >>patches posted (one at a time) run "dead or alive" check
> > >>and provide feedback. Currently my automation is limited to testing the 
> > >>trunk. So I will always be complaining after the patches are
> > >>committed. I think this is the way most other components testing works.
> > >>
> > >>What kind of regression suite do you and Sasha use?
> > >>    
> > >
> > >On my side it clearly depends from kind of changes. In general I would
> > >call this "uni-testing".
> > >
> > >  
> > >>Can we agree on minimal pre-commit testing?
> > >>Can we have a branch for that sake where all patches will first have to 
> > >>go into for 2 days? (it will allow for pre-trunk testing).
> > >>    
> > >
> > >One more development branch? Will you test (or even see) this? If so I
> > >can publish the "fresh" tree.
> > >
> > >Sasha
> > >
> > >_______________________________________________
> > >openib-general mailing list
> > >openib-general at openib.org
> > >http://openib.org/mailman/listinfo/openib-general
> > >
> > >To unsubscribe, please visit 
> > >http://openib.org/mailman/listinfo/openib-general
> > >  
> > 


From sashak at voltaire.com  Sat Dec  9 11:11:48 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 21:11:48 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165688413.26559.55471.camel@hal.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
Message-ID: <20061209191148.GP10000@sashak.voltaire.com>

On 13:20 Sat 09 Dec     , Hal Rosenstock wrote:
> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote:
> > Hi Eitan,
> > 
> > On 16:26 Sat 09 Dec     , Eitan Zahavi wrote:
> > > 
> > > Without another devel branch I will not be able to test patches before 
> > > the make it into the trunk.
> > > 
> > > I do not know how to make an automatic mail extraction into patches into 
> > > tree such that I can have automatic patch check.
> > 
> > You can just pipe emails with patches to git-am (manually after review
> > or automatically via procmail), so this will be committed in the local
> > tree/branch as you want.
> > 
> > > I am not a great fan of a new branch too.
> > > 
> > > So we need to agree that regression runs resulting with bug reporting 
> > > post commit to the trunk is our mode of work.
> > 
> > It is ok for me. At least as start point, if we will have automatic
> > nightly regression tests for the trunk it is just fine. If this will
> > work, and after collecting some experience we may think about
> > "quarantine" branch/tree and the regression testing expansion.
> > 
> > > I do not have a big issue with this (but it is more work for Hal).
> > 
> > Hal, what do you say?
> 
> What is the nightly regression and who will run it ?

Good question. I guess Eitan has automated regression test suite which
is able to pull _committed_ tree and run test series. Eitan, right?

> 
> It seems to me that the patches could be automated or a manual procedure
> can be put in place so I am not keen on maintaining a pre-trunk branch
> but would if I am convinced it can't be done easily by the methods I
> mentioned, that the regression would be run nightly on a continuing
> basis, and that reports would be issued based on the runs (to interested
> parties).

Ok.

I think we could start testing with trunk if we still have the issue
with pre-trunk patches. Systematic regression report would be good
thing. All this should be good start, and if I understand correctly this
can be launched immediately. Then we can deal with pre-trunk stuff.

Eitan, how is it hard for you to prepare procmail's rule which will
automatically apply the patches from emails to the local pre-trunk
tree? Or do you think it is insufficient?

Sasha

> 
> -- Hal
> 
> > Sasha
> > 
> > > 
> > > Eitan
> > > 
> > > Sasha Khapyorsky wrote:
> > > >On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
> > > >  
> > > >>Instead on relying on bug reading I use automatic regression. I wish we 
> > > >>could agree on some regression that
> > > >>each developer will have to run before patches are committed to the trunk.
> > > >>On my side I would love to have an automatic way to include all the 
> > > >>patches posted (one at a time) run "dead or alive" check
> > > >>and provide feedback. Currently my automation is limited to testing the 
> > > >>trunk. So I will always be complaining after the patches are
> > > >>committed. I think this is the way most other components testing works.
> > > >>
> > > >>What kind of regression suite do you and Sasha use?
> > > >>    
> > > >
> > > >On my side it clearly depends from kind of changes. In general I would
> > > >call this "uni-testing".
> > > >
> > > >  
> > > >>Can we agree on minimal pre-commit testing?
> > > >>Can we have a branch for that sake where all patches will first have to 
> > > >>go into for 2 days? (it will allow for pre-trunk testing).
> > > >>    
> > > >
> > > >One more development branch? Will you test (or even see) this? If so I
> > > >can publish the "fresh" tree.
> > > >
> > > >Sasha
> > > >
> > > >_______________________________________________
> > > >openib-general mailing list
> > > >openib-general at openib.org
> > > >http://openib.org/mailman/listinfo/openib-general
> > > >
> > > >To unsubscribe, please visit 
> > > >http://openib.org/mailman/listinfo/openib-general
> > > >  
> > > 
> 


From mst at mellanox.co.il  Sat Dec  9 11:34:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 9 Dec 2006 21:34:43 +0200
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <adapsau6t1p.fsf@cisco.com>
References: <045401c71b02$d8d17a40$0281a8c0@ebpc> <adapsau6t1p.fsf@cisco.com>
Message-ID: <20061209193443.GB6891@mellanox.co.il>

>  > How about an OpenFabrics API version #define?
> 
> No other kernel subsystem has one, so I don't think it's realistic to
> expect one for IB.

include/net/ieee80211.h has one. It does not seem to work too well though.

-- 
MST


From eitan at mellanox.co.il  Sat Dec  9 11:36:44 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 09 Dec 2006 21:36:44 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061209191148.GP10000@sashak.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
	<20061209191148.GP10000@sashak.voltaire.com>
Message-ID: <457B104C.3090802@mellanox.co.il>

Sasha Khapyorsky wrote:
> On 13:20 Sat 09 Dec     , Hal Rosenstock wrote:
>   
>> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote:
>>     
>>> Hi Eitan,
>>>
>>> On 16:26 Sat 09 Dec     , Eitan Zahavi wrote:
>>>       
>>>> Without another devel branch I will not be able to test patches before 
>>>> the make it into the trunk.
>>>>
>>>> I do not know how to make an automatic mail extraction into patches into 
>>>> tree such that I can have automatic patch check.
>>>>         
>>> You can just pipe emails with patches to git-am (manually after review
>>> or automatically via procmail), so this will be committed in the local
>>> tree/branch as you want.
>>>
>>>       
>>>> I am not a great fan of a new branch too.
>>>>
>>>> So we need to agree that regression runs resulting with bug reporting 
>>>> post commit to the trunk is our mode of work.
>>>>         
>>> It is ok for me. At least as start point, if we will have automatic
>>> nightly regression tests for the trunk it is just fine. If this will
>>> work, and after collecting some experience we may think about
>>> "quarantine" branch/tree and the regression testing expansion.
>>>
>>>       
>>>> I do not have a big issue with this (but it is more work for Hal).
>>>>         
>>> Hal, what do you say?
>>>       
>> What is the nightly regression and who will run it ?
>>     
>
> Good question. I guess Eitan has automated regression test suite which
> is able to pull _committed_ tree and run test series. Eitan, right?
>   
Yes that is what we have.
Both simulated fabrics as well as the ULPs regression which uses OpenSM 
from the trunk (running a set of tests on smaller fabrics).
>   
>> It seems to me that the patches could be automated or a manual procedure
>> can be put in place so I am not keen on maintaining a pre-trunk branch
>> but would if I am convinced it can't be done easily by the methods I
>> mentioned, that the regression would be run nightly on a continuing
>> basis, and that reports would be issued based on the runs (to interested
>> parties).
>>     
>
> Ok.
>
> I think we could start testing with trunk if we still have the issue
> with pre-trunk patches. Systematic regression report would be good
> thing. All this should be good start, and if I understand correctly this
> can be launched immediately. Then we can deal with pre-trunk stuff.
>
> Eitan, how is it hard for you to prepare procmail's rule which will
> automatically apply the patches from emails to the local pre-trunk
> tree? Or do you think it is insufficient?
>   
I am not sure I can do the procmail thing myself. I am not familiar with 
it and lack the time to learn.
I can ask around. But I question why we need to define a different 
testing method from the rest of the OFA tree?
> Sasha
>
>   
>> -- Hal
>>
>>     
>>> Sasha
>>>
>>>       
>>>> Eitan
>>>>
>>>> Sasha Khapyorsky wrote:
>>>>         
>>>>> On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
>>>>>  
>>>>>           
>>>>>> Instead on relying on bug reading I use automatic regression. I wish we 
>>>>>> could agree on some regression that
>>>>>> each developer will have to run before patches are committed to the trunk.
>>>>>> On my side I would love to have an automatic way to include all the 
>>>>>> patches posted (one at a time) run "dead or alive" check
>>>>>> and provide feedback. Currently my automation is limited to testing the 
>>>>>> trunk. So I will always be complaining after the patches are
>>>>>> committed. I think this is the way most other components testing works.
>>>>>>
>>>>>> What kind of regression suite do you and Sasha use?
>>>>>>    
>>>>>>             
>>>>> On my side it clearly depends from kind of changes. In general I would
>>>>> call this "uni-testing".
>>>>>
>>>>>  
>>>>>           
>>>>>> Can we agree on minimal pre-commit testing?
>>>>>> Can we have a branch for that sake where all patches will first have to 
>>>>>> go into for 2 days? (it will allow for pre-trunk testing).
>>>>>>    
>>>>>>             
>>>>> One more development branch? Will you test (or even see) this? If so I
>>>>> can publish the "fresh" tree.
>>>>>
>>>>> Sasha
>>>>>
>>>>> _______________________________________________
>>>>> openib-general mailing list
>>>>> openib-general at openib.org
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>
>>>>> To unsubscribe, please visit 
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>>  
>>>>>           
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Sat Dec  9 12:08:37 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 9 Dec 2006 22:08:37 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061209191148.GP10000@sashak.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
	<20061209191148.GP10000@sashak.voltaire.com>
Message-ID: <20061209200837.GF6891@mellanox.co.il>

> Eitan, how is it hard for you to prepare procmail's rule which will
> automatically apply the patches from emails to the local pre-trunk
> tree? Or do you think it is insufficient?

This sounds like a fragile process. It seems much easier to just
have an unstable branch with untested patches. No?

-- 
MST


From eitan at mellanox.co.il  Sat Dec  9 12:09:57 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 09 Dec 2006 22:09:57 +0200
Subject: [openib-general] [PATCH] osm: trivial osm_log missmatch on vendor
	mlx
Message-ID: <457B1815.7000404@mellanox.co.il>

Hi Hal

This patch fixes some osm_log issues on the mlx vendor.

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

---
 osm/libvendor/osm_vendor_mlx_dispatcher.c |    3 ++-
 osm/libvendor/osm_vendor_mlx_txn.c        |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/osm/libvendor/osm_vendor_mlx_dispatcher.c 
b/osm/libvendor/osm_vendor_mlx_dispatcher.c
index e8b47dd..7e3bd78 100644
--- a/osm/libvendor/osm_vendor_mlx_dispatcher.c
+++ b/osm/libvendor/osm_vendor_mlx_dispatcher.c
@@ -134,7 +134,8 @@ osmv_dispatch_mad(IN osm_bind_handle_t
   {

     osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG,
-            "The bind handle %p is being closed. The MAD will not be 
dispatched.\n");
+            "The bind handle %p is being closed. "
+            "The MAD will not be dispatched.\n", p_bo);

     ret = IB_INTERRUPTED;
     goto dispatch_mad_done;
diff --git a/osm/libvendor/osm_vendor_mlx_txn.c 
b/osm/libvendor/osm_vendor_mlx_txn.c
index 1fd262f..234e33b 100644
--- a/osm/libvendor/osm_vendor_mlx_txn.c
+++ b/osm/libvendor/osm_vendor_mlx_txn.c
@@ -631,7 +631,7 @@ __osmv_txn_timeout_cb(IN uint64_t key,

         osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG,
                 "__osmv_txn_timeout_cb: "
-                "Retry request timout in : %u [msec].\n",
+                "Retry request timout in : %lu [msec].\n",
                 next_timeout_ms);
       }
     }
--
1.4.4.1.GIT


From sashak at voltaire.com  Sat Dec  9 13:07:24 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 23:07:24 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061209200837.GF6891@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
	<20061209191148.GP10000@sashak.voltaire.com>
	<20061209200837.GF6891@mellanox.co.il>
Message-ID: <20061209210724.GQ10000@sashak.voltaire.com>

On 22:08 Sat 09 Dec     , Michael S. Tsirkin wrote:
> > Eitan, how is it hard for you to prepare procmail's rule which will
> > automatically apply the patches from emails to the local pre-trunk
> > tree? Or do you think it is insufficient?
> 
> This sounds like a fragile process. It seems much easier to just
> have an unstable branch with untested patches. No?

I think it is almost equivalent in the way how this should be generated.
The difference is that "unstable" branch should be published and will
require some maintenance.

Sasha


From sashak at voltaire.com  Sat Dec  9 13:44:44 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 9 Dec 2006 23:44:44 +0200
Subject: [openib-general] [PATCH] osm.mcfdbs - ignore "empty" MLID or
 switch]
In-Reply-To: <20061206132643.GR26787@mellanox.co.il>
References: <457698BE.10907@mellanox.co.il> <4576C33C.7050204@mellanox.co.il>
	<20061206132643.GR26787@mellanox.co.il>
Message-ID: <20061209214444.GS10000@sashak.voltaire.com>

On 15:26 Wed 06 Dec     , Michael S. Tsirkin wrote:
> > 
> > Actually switches that do not have any MCG entry will not be included
> > in the dump file.
> > 
> > Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> > 
> > --- osm/opensm/osm_mcast_mgr.c    2006-12-06 12:39:13.018015000 +0200
> > +++ osm/opensm/osm_mcast_mgr.c    2006-12-06 12:06:29.602097000 +0200
> 
> All, to make integrating patches easier,
> please try to actually use git diff to generate patches,

Or just 'git-format-patch', which generates mbox ready for email
submission.

Sasha

> and put patches in following format:
> 
> Subject: [PATCH anytext] short log
> 
> From: <> <-------- optional author line if not same as person posting
> Short explanation for commit log.
> 
> Signed-off-by: <>
> 
> ---
> 
> arbirary long explanation
> 
> patch
> 
> 
> -- 
> MST
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Sat Dec  9 14:05:06 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Dec 2006 17:05:06 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457B104C.3090802@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
	<20061209191148.GP10000@sashak.voltaire.com>
	<457B104C.3090802@mellanox.co.il>
Message-ID: <1165701888.26559.65048.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 14:36, Eitan Zahavi wrote:
> Sasha Khapyorsky wrote:
> > On 13:20 Sat 09 Dec     , Hal Rosenstock wrote:
> >   
> >> On Sat, 2006-12-09 at 13:01, Sasha Khapyorsky wrote:
> >>     
> >>> Hi Eitan,
> >>>
> >>> On 16:26 Sat 09 Dec     , Eitan Zahavi wrote:
> >>>       
> >>>> Without another devel branch I will not be able to test patches before 
> >>>> the make it into the trunk.
> >>>>
> >>>> I do not know how to make an automatic mail extraction into patches into 
> >>>> tree such that I can have automatic patch check.
> >>>>         
> >>> You can just pipe emails with patches to git-am (manually after review
> >>> or automatically via procmail), so this will be committed in the local
> >>> tree/branch as you want.
> >>>
> >>>       
> >>>> I am not a great fan of a new branch too.
> >>>>
> >>>> So we need to agree that regression runs resulting with bug reporting 
> >>>> post commit to the trunk is our mode of work.
> >>>>         
> >>> It is ok for me. At least as start point, if we will have automatic
> >>> nightly regression tests for the trunk it is just fine. If this will
> >>> work, and after collecting some experience we may think about
> >>> "quarantine" branch/tree and the regression testing expansion.
> >>>
> >>>       
> >>>> I do not have a big issue with this (but it is more work for Hal).
> >>>>         
> >>> Hal, what do you say?
> >>>       
> >> What is the nightly regression and who will run it ?
> >>     
> >
> > Good question. I guess Eitan has automated regression test suite which
> > is able to pull _committed_ tree and run test series. Eitan, right?
> >   
> Yes that is what we have.
> Both simulated fabrics as well as the ULPs regression which uses OpenSM 
> from the trunk (running a set of tests on smaller fabrics).
> >   
> >> It seems to me that the patches could be automated or a manual procedure
> >> can be put in place so I am not keen on maintaining a pre-trunk branch
> >> but would if I am convinced it can't be done easily by the methods I
> >> mentioned, that the regression would be run nightly on a continuing
> >> basis, and that reports would be issued based on the runs (to interested
> >> parties).
> >>     
> >
> > Ok.
> >
> > I think we could start testing with trunk if we still have the issue
> > with pre-trunk patches. Systematic regression report would be good
> > thing. All this should be good start, and if I understand correctly this
> > can be launched immediately. Then we can deal with pre-trunk stuff.
> >
> > Eitan, how is it hard for you to prepare procmail's rule which will
> > automatically apply the patches from emails to the local pre-trunk
> > tree? Or do you think it is insufficient?
> >   
> I am not sure I can do the procmail thing myself. I am not familiar with 
> it and lack the time to learn.
> I can ask around. But I question why we need to define a different 
> testing method from the rest of the OFA tree?

The request for an extra branch for this is different from the rest of
the OFA tree.

-- Hal

> > Sasha
> >
> >   
> >> -- Hal
> >>
> >>     
> >>> Sasha
> >>>
> >>>       
> >>>> Eitan
> >>>>
> >>>> Sasha Khapyorsky wrote:
> >>>>         
> >>>>> On 18:42 Fri 08 Dec     , Eitan Zahavi wrote:
> >>>>>  
> >>>>>           
> >>>>>> Instead on relying on bug reading I use automatic regression. I wish we 
> >>>>>> could agree on some regression that
> >>>>>> each developer will have to run before patches are committed to the trunk.
> >>>>>> On my side I would love to have an automatic way to include all the 
> >>>>>> patches posted (one at a time) run "dead or alive" check
> >>>>>> and provide feedback. Currently my automation is limited to testing the 
> >>>>>> trunk. So I will always be complaining after the patches are
> >>>>>> committed. I think this is the way most other components testing works.
> >>>>>>
> >>>>>> What kind of regression suite do you and Sasha use?
> >>>>>>    
> >>>>>>             
> >>>>> On my side it clearly depends from kind of changes. In general I would
> >>>>> call this "uni-testing".
> >>>>>
> >>>>>  
> >>>>>           
> >>>>>> Can we agree on minimal pre-commit testing?
> >>>>>> Can we have a branch for that sake where all patches will first have to 
> >>>>>> go into for 2 days? (it will allow for pre-trunk testing).
> >>>>>>    
> >>>>>>             
> >>>>> One more development branch? Will you test (or even see) this? If so I
> >>>>> can publish the "fresh" tree.
> >>>>>
> >>>>> Sasha
> >>>>>
> >>>>> _______________________________________________
> >>>>> openib-general mailing list
> >>>>> openib-general at openib.org
> >>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>
> >>>>> To unsubscribe, please visit 
> >>>>> http://openib.org/mailman/listinfo/openib-general
> >>>>>  
> >>>>>           
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From halr at voltaire.com  Sat Dec  9 14:08:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 09 Dec 2006 17:08:54 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061209200837.GF6891@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<20061208221001.GG9193@sashak.voltaire.com>
	<457AC7AF.5090202@mellanox.co.il>
	<20061209180101.GL10000@sashak.voltaire.com>
	<1165688413.26559.55471.camel@hal.voltaire.com>
	<20061209191148.GP10000@sashak.voltaire.com>
	<20061209200837.GF6891@mellanox.co.il>
Message-ID: <1165701912.26559.65050.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 15:08, Michael S. Tsirkin wrote:
> > Eitan, how is it hard for you to prepare procmail's rule which will
> > automatically apply the patches from emails to the local pre-trunk
> > tree? Or do you think it is insufficient?
> 
> This sounds like a fragile process. It seems much easier to just
> have an unstable branch with untested patches. No?

Untested is an overexaggeration. They are tested but not by Eitan's
regression.

-- Hal


From mst at mellanox.co.il  Sat Dec  9 22:43:46 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 08:43:46 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165701912.26559.65050.camel@hal.voltaire.com>
References: <1165701912.26559.65050.camel@hal.voltaire.com>
Message-ID: <20061210064346.GC10403@mellanox.co.il>

> > > Eitan, how is it hard for you to prepare procmail's rule which will
> > > automatically apply the patches from emails to the local pre-trunk
> > > tree? Or do you think it is insufficient?
> > 
> > This sounds like a fragile process. It seems much easier to just
> > have an unstable branch with untested patches. No?
> 
> Untested is an overexaggeration. They are tested but not by Eitan's
> regression.

Sorry, I'm not trying to influence any policy decisions here,
I'm coming purely from git angle. *If* you want Eitan to test and Ack some
patches, *and want to automate the testing part*, the simplest thing to do is to
apply them on some git branch.

-- 
MST


From ogerlitz at voltaire.com  Sat Dec  9 23:42:34 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 09:42:34 +0200
Subject: [openib-general] Assigning IP addresses to IB interfaces
In-Reply-To: <d2ad857f0612081431q6decd412o2718019aaed1ae03@mail.gmail.com>
References: <d2ad857f0612081431q6decd412o2718019aaed1ae03@mail.gmail.com>
Message-ID: <457BBA6A.3020209@voltaire.com>

Adit Ranadive wrote:
> I have installed the OpenIB gen2 driver but the IB interfaces havent
> been assigned any IP addresses..
> Is it possible to assign them ip addresses using ifconfig and ping
> between the interfaces of two machines?

yes


From ogerlitz at voltaire.com  Sat Dec  9 23:50:37 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 09:50:37 +0200
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <045401c71b02$d8d17a40$0281a8c0@ebpc>
References: <045401c71b02$d8d17a40$0281a8c0@ebpc>
Message-ID: <457BBC4D.6050704@voltaire.com>

Eric Barton wrote:
>>  > Actually a single OFED version #define would most probably 
>>  > suit my purposes -
>>  > is that controversial?
>>
>> It might be sensible for OFED to supply that, if it's going to
>> backport drivers to old kernels.  But you should also cope with
>> non-OFED (vanilla upstream) drivers, probably by testing
>> LINUX_VERSION_CODE too I suppose.
> 
> How about an OpenFabrics API version #define?

The IB drivers provided by OFED are based on the mainline kernel ones, 
moreover, the existence of OFED is temporal, over time, distros would 
peek the IB code by themselves using releases of the linux kernel and of 
user space packages (libraries).

There is no point in adding a version-ing system.

Or.


From ogerlitz at voltaire.com  Sun Dec 10 00:49:51 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 10:49:51 +0200
Subject: [openib-general] [RFC] [PATCH V2 0/3] bonding support for
 operation over IPoIB
In-Reply-To: <Pine.LNX.4.64.0611301256130.26156@zuben>
References: <Pine.LNX.4.64.0611301256130.26156@zuben>
Message-ID: <457BCA2F.1010709@voltaire.com>

Or Gerlitz wrote:
> This patch series is a second version (see below link to V1) of the suggested
> changes to the bonding driver such that it would be able to support non ARPHRD_ETHER
> netdevices for its High-Availability (active-backup) mode.
> 
> The motivation is to enable the bonding driver on its HA mode to work with the
> IP over Infiniband (IPoIB) driver. With these patches I was able to enslave
> IPoIB netdevices and run TCP, UDP, IP (UDP) Multicast and ICMP traffic with
> fail-over and fail-back working fine. My working env was the net-2.6.20 git.
> 
> More over, as IPoIB is also the IB ARP provider for the RDMA CM driver which
> is used by native IB ULPs whose addressing scheme is based on IP (eg iSER, SDP,
> Lustre, NFSoRDMA, RDS), bonding support for IPoIB devices **enables** HA for
> these ULPs. This holds as when the ULP is informed by the IB HW on the failure
> of the current IB connection, it just need to reconnect, where the bonding
> device will now issue the IB ARP over the active IPoIB slave.

As of the importance and great need for HA, I would really like to get 
feedback from people testing configurations with bonded IPoIB devices 
before moving forward with this.

Or.


From ogerlitz at voltaire.com  Sun Dec 10 01:08:38 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 11:08:38 +0200
Subject: [openib-general] [PATCH 0/5 v3] 2.6.20 rdma/cma: add userspace
 support
In-Reply-To: <ada1wncc2ss.fsf@cisco.com>
References: <ada1wncc2ss.fsf@cisco.com>
Message-ID: <457BCE96.3070708@voltaire.com>

Roland Dreier wrote:
>  > + 5/5 is the CMA user space support. I only did a light review of it
>  > but my understanding is that Sean used the in kernel ib_ucm
>  > design/code as the base line for this driver so there should be no
>  > special issues here.
> 
> OK, I'll have to take a close look at this.  ucm has known-broken
> object lifetime handling (probably oopsable from userspace)

Where are we standing with this patch series, i understand the 2.6.20 
RC1 feature window is about to me closed in few days.

Or.


From ogerlitz at voltaire.com  Sun Dec 10 01:21:17 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 11:21:17 +0200
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <1165517253.14800.283.camel@brick.pathscale.com>
References: <1165517253.14800.283.camel@brick.pathscale.com>
Message-ID: <457BD18D.7000403@voltaire.com>

Ralph Campbell wrote:
> This version of the patch fixes ipath_sg_dma_address() and
> updates the comments for ipath_dma.c as Or Gerlitz
> suggested.

> This patch implements the interposing DMA mapping functions to allow
> support for IOMMUs and remove the dependence on phys_to_virt() and
> bus_to_virt().

Ralph,

The patch seems ready modulo the resolution of whether you implement the 
addresses returned by the ipath ib_dma_map_xxx code as keys into a SW 
IOTLB (which means you return dma_address_t and not u64 but assign it 
ipath semantics) or choose a different path to follow (ie assume the 
problem exists only under the unsupported by ipath 32bit / high-mem 
config, do nothing, etc) - what ever you set with Roland.

Or.


From ogerlitz at voltaire.com  Sun Dec 10 02:19:01 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 12:19:01 +0200
Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work
 with proposed 2.6.20 kernel CMA
In-Reply-To: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com>
References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com>
Message-ID: <457BDF15.6090608@voltaire.com>

Sean Hefty wrote:
> Updates the librdmacm to work with ABI version 3, which is the proposed
> kernel changes for inclusion in 2.6.20.

Sean,

rdma_leave_multicast does not return zero on success but rather 24 which 
is the length of the leave mcast msg.

patch made over your path, can you please queue this somewhere so it 
will not be forgotten?

Or.

> --- librdmacm/src/cma.c 2006-12-10 12:55:03.000000000 +0200
> +++ librdmacm-multicast/src/cma.c       2006-12-10 13:15:12.000000000 +0200
> @@ -1015,6 +1015,8 @@ int rdma_leave_multicast(struct rdma_cm_
>         ret = write(id->channel->fd, msg, size);
>         if (ret != size)
>                 ret = (ret > 0) ? -ENODATA : ret;
> +       else
> +               ret = 0;
> 
>         pthread_mutex_lock(&id_priv->mut);
>         while (mc->events_completed < resp->events_reported)


From ogerlitz at voltaire.com  Sun Dec 10 03:20:26 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Sun, 10 Dec 2006 13:20:26 +0200 (IST)
Subject: [openib-general] SLES10 /sbin/ip shows wrong multicast addresses
 for IPoIB devices
Message-ID: <Pine.LNX.4.64.0612101308570.12972@zuben>

I see now that /sbin/ip that comes with SLES10 iproute2-2.6.15-14.4 shows wrong
multicast hardware (L2) addresses for IPoIB devices, where on SLES9 it works
just fine (and the package is iproute2-2.4.7-866.8)

With strac-ing it, i can see that utility uses /proc/net/dev_mcast as the
source for the hw mcast addresses, where these are reported fine, but then
the lower 32 bits are somehow chopped and replaced by zeros, see below.

Or.

sage:~ # /sbin/ip a s ib1
6: ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 128
    link/infiniband 00:00:04:05:fe:80:00:00:00:00:00:00:00:02:c9:02:00:20:13:f2 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
    inet 192.168.10.153/24 brd 192.168.10.255 scope global ib1
    inet6 fe80::202:c902:20:13f2/64 scope link
       valid_lft forever preferred_lft forever

sage:~ # /sbin/ip m s ib1

6:      ib1
        link  00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:00:00:00:00
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:01:00:00:00:00
        link  00:ff:ff:ff:ff:12:60:1b:00:00:00:00:00:00:00:00:00:00:00:00
        link  00:ff:ff:ff:ff:12:40:1b:00:00:00:00:00:00:00:00:00:00:00:00
        inet  224.5.5.5
        inet  224.0.0.1
        inet6 ff02::1:ff20:13f2
        inet6 ff02::1

sage:~ # strace /sbin/ip m s ib1 2>&1 | grep open

open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib64/libresolv.so.2", O_RDONLY) = 3
open("/lib64/libc.so.6", O_RDONLY)      = 3
open("/proc/net/dev_mcast", O_RDONLY)   = 4
open("/proc/net/igmp", O_RDONLY)        = 4
open("/proc/net/igmp6", O_RDONLY)       = 4

sage:~ # cat /proc/net/dev_mcast | grep ib1

6    ib1             1     0     00ffffffff12401b000000000000000000050505
6    ib1             1     0     00ffffffff12601b0000000000000001ff2013f2
6    ib1             1     0     00ffffffff12601b000000000000000000000001
6    ib1             1     0     00ffffffff12401b000000000000000000000001


From monis at voltaire.com  Sun Dec 10 03:40:17 2006
From: monis at voltaire.com (Moni Shoua)
Date: Sun, 10 Dec 2006 13:40:17 +0200
Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module parameters
Message-ID: <457BF221.8080701@voltaire.com>

Hi,
This patch was sent a while ago and I'd like to repost it now.

thanks
MoniS


From: Leonid Arsh <leonida at voltaire.com>

Adds module parameters that enable settting some of the HCA
profile values
Signed-off-by: Leonid Arsh <leonida at voltaire.com>
Signed-off-by: Moni Shoua <monis at voltaire.com>
---
 mthca_main.c |  115 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 104 insertions(+), 11 deletions(-)
diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 47ea021..deb0289 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -82,21 +82,110 @@ MODULE_PARM_DESC(tune_pci, "increase PCI
 
 struct mutex mthca_device_mutex;
 
+#define MTHCA_DEFAULT_NUM_QP            (1 << 16)
+#define MTHCA_DEFAULT_RDB_PER_QP        (1 << 2)
+#define MTHCA_DEFAULT_NUM_CQ            (1 << 16)
+#define MTHCA_DEFAULT_NUM_MCG           (1 << 13)
+#define MTHCA_DEFAULT_NUM_MPT           (1 << 17)
+#define MTHCA_DEFAULT_NUM_MTT           (1 << 20)
+#define MTHCA_DEFAULT_NUM_UDAV          (1 << 15)
+#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18)
+#define MTHCA_DEFAULT_NUM_UARC_SIZE     (1 << 18)
+
+static struct mthca_profile default_profile = {
+	.num_qp             = MTHCA_DEFAULT_NUM_QP,
+	.rdb_per_qp         = MTHCA_DEFAULT_RDB_PER_QP,
+	.num_cq             = MTHCA_DEFAULT_NUM_CQ,
+	.num_mcg            = MTHCA_DEFAULT_NUM_MCG,
+	.num_mpt            = MTHCA_DEFAULT_NUM_MPT,
+	.num_mtt            = MTHCA_DEFAULT_NUM_MTT,
+	.num_udav           = MTHCA_DEFAULT_NUM_UDAV,          /* Tavor only */
+	.fmr_reserved_mtts  = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */
+	.uarc_size          = MTHCA_DEFAULT_NUM_UARC_SIZE,     /* Arbel only */
+};
+
+module_param_named(num_qp, default_profile.num_qp, int, 0444);
+MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA");
+
+module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444);
+MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP");
+
+module_param_named(num_cq, default_profile.num_cq, int, 0444);
+MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA");
+
+module_param_named(num_mcg, default_profile.num_mcg, int, 0444);
+MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA");
+
+module_param_named(num_mpt, default_profile.num_mpt, int, 0444);
+MODULE_PARM_DESC(num_mpt, 
+		"maximum number of memory protection pable entries per HCA");
+
+module_param_named(num_mtt, default_profile.num_mtt, int, 0444);
+MODULE_PARM_DESC(num_mtt,
+		 "maximum number of memory translation table segments per HCA");
+/* Tavor only */
+module_param_named(num_udav, default_profile.num_udav, int, 0444);
+MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA");
+
+/* Tavor only */
+module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444);
+MODULE_PARM_DESC(fmr_reserved_mtts,
+		 "number of memory translation table segments reserved for FMR");
+
 static const char mthca_version[] __devinitdata =
 	DRV_NAME ": Mellanox InfiniBand HCA driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
-static struct mthca_profile default_profile = {
-	.num_qp		   = 1 << 16,
-	.rdb_per_qp	   = 4,
-	.num_cq		   = 1 << 16,
-	.num_mcg	   = 1 << 13,
-	.num_mpt	   = 1 << 17,
-	.num_mtt	   = 1 << 20,
-	.num_udav	   = 1 << 15,	/* Tavor only */
-	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
-	.uarc_size	   = 1 << 18,	/* Arbel only */
-};
+
+static int __devinit mthca_check_profile_value(int* pval, int pval_default){
+	/* value must be positive and power of 2 */
+	int old_pval = *pval;
+
+	if (old_pval <= 0)
+		*pval = pval_default;
+	else
+		*pval = roundup_pow_of_two(old_pval);
+
+	return old_pval-*pval;
+}
+
+#define mthca_check_profile_and_warn(name, var, defval) \
+	if (mthca_check_profile_value(&var, defval)) \
+		mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); 
+
+static int __devinit mthca_validate_profile(struct mthca_dev *mdev,
+                                            struct mthca_profile *profile)
+{
+
+	mthca_check_profile_and_warn(num_qp, default_profile.num_qp,
+						 MTHCA_DEFAULT_NUM_QP);
+	mthca_check_profile_and_warn(rdb_per_qp, default_profile.rdb_per_qp,
+						 MTHCA_DEFAULT_RDB_PER_QP);
+	mthca_check_profile_and_warn(num_cq, default_profile.num_cq,
+						 MTHCA_DEFAULT_NUM_CQ);
+	mthca_check_profile_and_warn(num_mcg, default_profile.num_mcg,
+						 MTHCA_DEFAULT_NUM_MCG);
+	mthca_check_profile_and_warn(num_mpt, default_profile.num_mpt,
+						 MTHCA_DEFAULT_NUM_MPT);
+	mthca_check_profile_and_warn(num_mtt, default_profile.num_mtt,
+						 MTHCA_DEFAULT_NUM_MTT);
+
+	if (!mthca_is_memfree(mdev)) {
+		mthca_check_profile_and_warn(num_udav, default_profile.num_udav,
+							 MTHCA_DEFAULT_NUM_UDAV);
+		mthca_check_profile_and_warn(fmr_reserved_mtts, default_profile.fmr_reserved_mtts,
+							 MTHCA_DEFAULT_NUM_RESERVED_MTTS);
+
+		if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) {
+			mthca_err(mdev, "Invalid fmr_reserved_mtts parameter" 
+					  "value (%d). Must be lower then num_mtt (%d)\n",
+					  default_profile.fmr_reserved_mtts,
+					  default_profile.num_mtt ); 
+			return -EINVAL;
+		}
+	}
+	return 0;
+}
 
 static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
 {
@@ -1084,6 +1173,10 @@ static int __mthca_init_one(struct pci_d
 	if (err)
 		goto err_cmd;
 
+	err = mthca_validate_profile(mdev, &default_profile);
+	if (err)
+		goto err_cmd;
+
 	err = mthca_init_hca(mdev);
 	if (err)
 		goto err_cmd;


From mst at mellanox.co.il  Sun Dec 10 05:41:37 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 15:41:37 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061205161944.GD30209@mellanox.co.il>
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
Message-ID: <20061210134137.GL29174@mellanox.co.il>

The following patch adds experimental support for IPoIB connected mode.
The idea is to increase performance by increasing the MTU
from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD.
With this code, I'm able to get 800MByte/sec or more with netperf
without options on a Mellanox 4x back-to-back DDR system.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

Changes from the previous revision:
- Use scatter on RX side instead of allocating a linear 64K skb
- User now must explicitly enable connected mode through
  sysfs for each interface (I looked at using ethtool, and didn't find
  an appropriate option for that).
  A warning is printed when it's enabled.
- Print a warning about multicast breakage when MTU > 2044
- Move more code to within #ifdef CONFIG_INFINIBAND_IPOIB_CM to avoid
  affecting code footprint when disabled at compile time

Please review, and consider for merging.

I labeled CM support as experimental, and set it to disabled by default,
although its been very stable for me, mostly because there are still some
things to be addressed before it's as usable as IPoIB UD. I am very interested
in getting this code in shape for merging as early as possible, as opposed to
maintaining it out of tree until it's fully mature, and I tried to split the
CM code in a separate file to make this feasible.

Let me know whether this was a good idea, or whether more needs to be done
in this direction.

Note that the connected mode support adds very little overhead when not activated
at run time, and zero data-path overhead when not activated at compile time.
Here's a short description of what the patch does:

a. The code's here:
git://staging.openfabrics.org/~mst/linux-2.6/.git ipoib_cm_branch
This is based on 2.6.19, so
~>git diff v2.6.19..ipoib_cm_branch
will show what I have done so far.

b. How to activate:
Server:
#modprobe ib_ipoib
#echo connected > /sys/class/net/ib0/mode
#/sbin/ifconfig ib0 mtu 65520
#./netperf-2.4.2/src/netserver

Client:
#modprobe ib_ipoib
#echo connected > /sys/class/net/ib0/mode
#/sbin/ifconfig ib0 mtu 65520
#./netperf-2.4.2/src/netperf -H 11.4.3.68 -f M
	TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 11.4.3.68 (11.4.3.68)
	port 0 AF_INET : demo
	Recv   Send    Send
	Socket Socket  Message  Elapsed
	Size   Size    Size     Time     Throughput
	bytes  bytes   bytes    secs.    MBytes/sec

	87380  16384  16384    10.01     891.21

c. TODO list
1. Use timer to clean up stale RX connections
2. (Optional) Send side S/G support
3. (Optional) Make CM use same CQ IPoIB uses for UD

d. Limitations
With MTU > 2044, UDP multicast and UDP connections to IPoIB UD mode
currently don't work since we get packets that are too large to
send over a UD QP.
As a work around, one can now create separate interfaces
for use with CM and UD mode.

e. Some notes on code
1. SRQ is used for scalability to large cluster sizes
2. Only RC connections are used (UC does not support SRQ now)
3. Retry count is set to 0 since spec draft warns against retries
4. Each connection is used for data transfers in only 1 direction,
   so each connection is either active(TX) or passive (RX).
   2 sides that want to communicate create 2 connections.
5. Each active (TX) connection has a separate CQ for send completions -
   this keeps the code simple without CQ resize and other tricks

I'm looking at ways to limit the path mtu for these connections, to make it work.

diff --git a/drivers/infiniband/ulp/ipoib/Kconfig b/drivers/infiniband/ulp/ipoib/Kconfig
index c75322d..083c729 100644
--- a/drivers/infiniband/ulp/ipoib/Kconfig
+++ b/drivers/infiniband/ulp/ipoib/Kconfig
@@ -8,6 +8,20 @@ config INFINIBAND_IPOIB
 
 	  See Documentation/infiniband/ipoib.txt for more information
 
+config INFINIBAND_IPOIB_CM
+	bool "IP-over-InfiniBand Connected Mode support"
+	depends on INFINIBAND_IPOIB && EXPERIMENTAL
+	default n
+	---help---
+	  This option enables experimental support for IPoIB connected mode.
+	  After enabling this option, you need to switch to connected mode through
+	  /sys/class/net/ibXXX/mode to actually create connections, and then increase
+	  the interface MTU with e.g. ifconfig ib0 mtu 65520.
+
+	  WARNING: Enabling connected mode currently breaks multicast and UD mode
+	  connectivity from this interface unless you limit mtu
+	  for these destinations to 2044.
+
 config INFINIBAND_IPOIB_DEBUG
 	bool "IP-over-InfiniBand debugging" if EMBEDDED
 	depends on INFINIBAND_IPOIB
diff --git a/drivers/infiniband/ulp/ipoib/Makefile b/drivers/infiniband/ulp/ipoib/Makefile
index 8935e74..98ee38e 100644
--- a/drivers/infiniband/ulp/ipoib/Makefile
+++ b/drivers/infiniband/ulp/ipoib/Makefile
@@ -5,5 +5,6 @@ ib_ipoib-y					:= ipoib_main.o \
 						   ipoib_multicast.o \
 						   ipoib_verbs.o \
 						   ipoib_vlan.o
+ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_CM)		+= ipoib_cm.o
 ib_ipoib-$(CONFIG_INFINIBAND_IPOIB_DEBUG)	+= ipoib_fs.o
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 0b8a79d..e410d2b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -62,6 +62,10 @@ enum {
 
 	IPOIB_ENCAP_LEN 	  = 4,
 
+	IPOIB_CM_MTU              = 0x10000 - 0x10, /* padding to align header to 16 */
+	IPOIB_CM_BUF_SIZE         = IPOIB_CM_MTU  + IPOIB_ENCAP_LEN,
+	IPOIB_CM_HEAD_SIZE 	  = IPOIB_CM_BUF_SIZE % PAGE_SIZE,
+	IPOIB_CM_RX_SG            = ALIGN(IPOIB_CM_BUF_SIZE, PAGE_SIZE) / PAGE_SIZE,
 	IPOIB_RX_RING_SIZE 	  = 128,
 	IPOIB_TX_RING_SIZE 	  = 64,
 	IPOIB_MAX_QUEUE_SIZE	  = 8192,
@@ -81,6 +85,8 @@ enum {
 	IPOIB_MCAST_RUN 	  = 6,
 	IPOIB_STOP_REAPER         = 7,
 	IPOIB_MCAST_STARTED       = 8,
+	IPOIB_FLAG_NETIF_STOPPED  = 9,
+	IPOIB_FLAG_ADMIN_CM 	  = 10,
 
 	IPOIB_MAX_BACKOFF_SECONDS = 16,
 
@@ -113,6 +119,58 @@ struct ipoib_tx_buf {
 	DECLARE_PCI_UNMAP_ADDR(mapping)
 };
 
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+struct ib_cm_id;
+
+struct ipoib_cm_data {
+	__be32 qpn; /* High byte MUST be ignored on receive */
+	__be32 mtu;
+};
+
+struct ipoib_cm_rx {
+	struct ib_cm_id     *id;
+	struct ib_qp        *qp;
+	struct list_head     list;
+	struct net_device   *dev;
+};
+
+struct ipoib_cm_tx {
+	struct ib_cm_id     *id;
+	struct ib_cq        *cq;
+	struct ib_qp        *qp;
+	struct list_head     list;
+	struct net_device   *dev;
+	struct ipoib_neigh  *neigh;
+	struct ipoib_path   *path;
+	struct ipoib_tx_buf *tx_ring;
+	unsigned             tx_head;
+	unsigned             tx_tail;
+	unsigned long        flags;
+	u32                  mtu;
+	struct ib_wc         ibwc[IPOIB_NUM_WC];
+};
+
+struct ipoib_cm_rx_buf {
+	struct sk_buff *skb;
+	dma_addr_t mapping[IPOIB_CM_RX_SG];
+};
+
+struct ipoib_cm_dev_priv {
+	struct ib_cq  	       *cq;
+	struct ib_srq  	       *srq;
+	struct ipoib_cm_rx_buf *srq_ring;
+	struct ib_cm_id        *id;
+	struct list_head        passive_ids;
+	struct work_struct      start_task;
+	struct work_struct      reap_task;
+	struct list_head        start_list;
+	struct list_head        reap_list;
+	struct ib_wc            ibwc[IPOIB_NUM_WC];
+	struct ib_sge           rx_sge[IPOIB_CM_RX_SG];
+	struct ib_recv_wr       rx_wr;
+};
+
+#endif
 /*
  * Device private locking: tx_lock protects members used in TX fast
  * path (and we use LLTX so upper layers don't do extra locking).
@@ -179,6 +237,10 @@ struct ipoib_dev_priv {
 	struct list_head child_intfs;
 	struct list_head list;
 
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+	struct ipoib_cm_dev_priv cm;
+#endif
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 	struct list_head fs_list;
 	struct dentry *mcg_dentry;
@@ -212,6 +274,9 @@ struct ipoib_path {
 
 struct ipoib_neigh {
 	struct ipoib_ah    *ah;
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+	struct ipoib_cm_tx *cm;
+#endif
 	union ib_gid        dgid;
 	struct sk_buff_head queue;
 
@@ -315,6 +380,131 @@ int ipoib_vlan_delete(struct net_device *pdev, unsigned short pkey);
 void ipoib_pkey_poll(void *dev);
 int ipoib_pkey_dev_delay_open(struct net_device *dev);
 
+#ifdef CONFIG_INFINIBAND_IPOIB_CM
+
+#define IPOIB_FLAGS_RC          0x80
+#define IPOIB_FLAGS_UC          0x40
+
+#define IPOIB_CM_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_RC | IPOIB_FLAGS_UC))
+
+static inline int ipoib_cm_admin_enabled(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	return IPOIB_CM_SUPPORTED(dev->dev_addr) &&
+	       	test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
+}
+
+static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	return IPOIB_CM_SUPPORTED(n->ha) &&
+	       	test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
+}
+
+static inline int ipoib_cm_up(struct ipoib_neigh *neigh)
+
+{
+	return test_bit(IPOIB_FLAG_OPER_UP, &neigh->cm->flags);
+}
+
+static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh)
+{
+	return neigh->cm;
+}
+
+static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *tx)
+{
+	neigh->cm = tx;
+}
+
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx);
+int ipoib_cm_dev_open(struct net_device *dev);
+void ipoib_cm_dev_stop(struct net_device *dev);
+int ipoib_cm_dev_init(struct net_device *dev);
+int ipoib_cm_add_mode_attr(struct net_device *dev);
+void ipoib_cm_dev_cleanup(struct net_device *dev);
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				    struct ipoib_neigh *neigh);
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
+#else
+
+struct ipoib_cm_tx;
+
+static inline int ipoib_cm_admin_enabled(struct net_device *dev)
+{
+	return 0;
+}
+static inline int ipoib_cm_enabled(struct net_device *dev, struct neighbour *n)
+
+{
+	return 0;
+}
+
+static inline int ipoib_cm_up(struct ipoib_neigh *neigh)
+
+{
+	return 0;
+}
+
+static inline struct ipoib_cm_tx *ipoib_cm_get(struct ipoib_neigh *neigh)
+{
+	return NULL;
+}
+
+static inline void ipoib_cm_set(struct ipoib_neigh *neigh, struct ipoib_cm_tx *tx)
+{
+}
+
+static inline
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
+{
+	return;
+}
+
+static inline
+int ipoib_cm_dev_open(struct net_device *dev)
+{
+	return 0;
+}
+
+static inline
+void ipoib_cm_dev_stop(struct net_device *dev)
+{
+	return; 
+}
+
+static inline
+int ipoib_cm_dev_init(struct net_device *dev)
+{
+	return 0;
+}
+
+static inline
+void ipoib_cm_dev_cleanup(struct net_device *dev)
+{
+	return;
+}
+
+static inline
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				    struct ipoib_neigh *neigh)
+{
+	return NULL;
+}
+
+static inline
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
+{
+	return;
+}
+
+static inline
+int ipoib_cm_add_mode_attr(struct net_device *dev)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
 void ipoib_create_debug_files(struct net_device *dev);
 void ipoib_delete_debug_files(struct net_device *dev);
@@ -392,4 +582,6 @@ extern int ipoib_debug_level;
 
 #define IPOIB_GID_ARG(gid)	IPOIB_GID_RAW_ARG((gid).raw)
 
+#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
+
 #endif /* _IPOIB_H */
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
new file mode 100644
index 0000000..52dcc10
--- /dev/null
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -0,0 +1,1153 @@
+/*
+ * Copyright (c) 2006 Mellanox Technologies. All rights reserved
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Id$
+ */
+
+#include <rdma/ib_cm.h>
+#include <rdma/ib_cache.h>
+
+#ifdef CONFIG_INFINIBAND_IPOIB_DEBUG_DATA
+static int data_debug_level;
+
+module_param_named(cm_data_debug_level, data_debug_level, int, 0644);
+MODULE_PARM_DESC(cm_data_debug_level,
+		 "Enable data path debug tracing for connected mode if > 0");
+#endif
+
+#include "ipoib.h"
+
+#define IPOIB_CM_IETF_ID 0x1000000000000000ULL
+
+#define	IPOIB_OP_SRQ	(1ul << 30)
+
+struct ipoib_cm_id {
+	struct ib_cm_id *id;
+	int flags;
+	u32 remote_qpn;
+	u32 remote_mtu;
+};
+
+int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+
+static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv,
+				  dma_addr_t mapping[IPOIB_CM_RX_SG])
+{
+	int i;
+
+	dma_unmap_single(priv->ca->dma_device, mapping[0],
+			 IPOIB_CM_HEAD_SIZE, DMA_FROM_DEVICE);
+
+	for (i = 0; i < IPOIB_CM_RX_SG - 1; ++i) {
+		dma_unmap_single(priv->ca->dma_device, mapping[i + 1],
+				 PAGE_SIZE, DMA_FROM_DEVICE);
+	}
+}
+
+static int ipoib_cm_post_receive(struct net_device *dev, int id)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_recv_wr *bad_wr;
+	int i, ret;
+
+	priv->cm.rx_wr.wr_id = id | IPOIB_OP_SRQ;
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
+
+	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
+	if (unlikely(ret)) {
+		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
+		ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[id].mapping);
+		dev_kfree_skb_any(priv->cm.srq_ring[id].skb);
+		priv->cm.srq_ring[id].skb = NULL;
+	}
+
+	return ret;
+}
+
+static int ipoib_cm_alloc_rx_skb(struct net_device *dev, int id,
+				 dma_addr_t mapping[IPOIB_CM_RX_SG])
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct sk_buff *skb;
+	int i;
+
+	skb = dev_alloc_skb(IPOIB_CM_HEAD_SIZE + 12);
+	if (unlikely(!skb))
+		return -ENOMEM;
+
+	/*
+	 * IPoIB adds a 4 byte header. So we need 12 more bytes to align the
+	 * IP header to a multiple of 16.
+	 */
+	skb_reserve(skb, 12);
+
+	mapping[0] = dma_map_single(priv->ca->dma_device, skb->data, IPOIB_CM_HEAD_SIZE,
+				    DMA_FROM_DEVICE);
+	if (unlikely(dma_mapping_error(mapping[0]))) {
+		dev_kfree_skb_any(skb);
+		return -EIO;
+	}
+
+	for (i = 0; i < IPOIB_CM_RX_SG - 1; i++) {
+		struct page *page = alloc_page(GFP_ATOMIC);
+
+		if (!page)
+			goto partial_error;
+		skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE);
+
+		mapping[i + 1] = dma_map_page(priv->ca->dma_device,
+					      skb_shinfo(skb)->frags[i].page,
+					      0, PAGE_SIZE, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(mapping[i + 1])))
+			goto partial_error;
+	}
+
+	priv->cm.srq_ring[id].skb = skb;
+	return 0;
+
+partial_error:
+
+	dma_unmap_single(priv->ca->dma_device,
+			 mapping[0],
+			 IPOIB_CM_HEAD_SIZE,
+			 DMA_FROM_DEVICE);
+
+	for (; i >= 0; --i) {
+		dma_unmap_single(priv->ca->dma_device,
+				 mapping[i + 1],
+				 PAGE_SIZE,
+				 DMA_FROM_DEVICE);
+	}
+	kfree_skb(skb);
+	return -ENOMEM;
+}
+
+static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr attr = {
+		.send_cq = priv->cm.cq, /* does not matter, we never send anything */
+		.recv_cq = priv->cm.cq,
+		.srq = priv->cm.srq,
+		.cap.max_send_wr = 1, /* FIXME: 0 Seems not to work */
+		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type = IB_QPT_RC,
+	};
+	return ib_create_qp(priv->pd, &attr);
+}
+
+static int ipoib_cm_modify_rx_rts(struct net_device *dev,
+				  struct ib_cm_id *cm_id, struct ib_qp *qp)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for INIT: %d\n", ret);
+		return ret;
+	}
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to INIT: %d\n", ret);
+		return ret;
+	}
+	qp_attr.qp_state = IB_QPS_RTR;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret);
+		return ret;
+	}
+	qp_attr.rq_psn = 0 /* FIXME */;
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static int ipoib_cm_send_rep(struct net_device *dev, struct ib_cm_id *cm_id,
+			     struct ib_qp *qp, struct ib_cm_req_event_param *req)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_data data = {};
+	struct ib_cm_rep_param rep = {};
+
+	data.qpn = cpu_to_be32(priv->qp->qp_num);
+	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+
+	rep.private_data = &data;
+	rep.private_data_len = sizeof data;
+	rep.flow_control = 0;
+	rep.rnr_retry_count = req->rnr_retry_count;
+	rep.target_ack_delay = 20; /* FIXME */
+	rep.srq = 1;
+	rep.qp_num = qp->qp_num;
+	rep.starting_psn = 0 /* FIXME */;
+	return ib_send_cm_rep(cm_id, &rep);
+}
+
+static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct net_device *dev = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_rx *p;
+	unsigned long flags;
+	int ret;
+
+	ipoib_dbg(priv, "REQ arrived\n");
+	p = kzalloc(sizeof *p, GFP_KERNEL);
+	if (!p)
+		return -ENOMEM;
+	p->dev = dev;
+	p->id = cm_id;
+	p->qp = ipoib_cm_create_rx_qp(dev);
+	if (IS_ERR(p->qp)) {
+		ret = PTR_ERR(p->qp);
+		goto err_qp;
+	}
+
+	ret = ipoib_cm_modify_rx_rts(dev, cm_id, p->qp);
+	if (ret)
+		goto err_modify;
+
+	ret = ipoib_cm_send_rep(dev, cm_id, p->qp, &event->param.req_rcvd);
+	if (ret) {
+		ipoib_warn(priv, "failed to send REP: %d\n", ret);
+		goto err_rep;
+	}
+
+	cm_id->context = p;
+	spin_lock_irqsave(&priv->lock, flags);
+	list_add(&p->list, &priv->cm.passive_ids);
+	spin_unlock_irqrestore(&priv->lock, flags);
+	return 0;
+
+err_rep:
+err_modify:
+	ib_destroy_qp(p->qp);
+err_qp:
+	kfree(p);
+	return ret;
+}
+
+int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_rx *p;
+	struct ipoib_dev_priv *priv;
+	unsigned long flags;
+	int ret;
+
+	switch (event->event) {
+	case IB_CM_REQ_RECEIVED:
+		return ipoib_cm_req_handler(cm_id, event);
+	case IB_CM_DREQ_RECEIVED:
+		p = cm_id->context;
+		ib_send_cm_drep(cm_id, NULL, 0);
+		/* Fall through */
+	case IB_CM_REJ_RECEIVED:
+		p = cm_id->context;
+		priv = netdev_priv(p->dev);
+		spin_lock_irqsave(&priv->lock, flags);
+		if (list_empty(&p->list))
+	       		ret = 0; /* Connection is going away already. */
+		else {
+			list_del(&p->list);
+			ret = -ECONNRESET;
+		}
+		spin_unlock_irqrestore(&priv->lock, flags);
+		if (ret) {
+			ib_destroy_qp(p->qp);
+			kfree(p);
+			return ret;
+		}
+		return 0;
+	default:
+		return 0;
+	}
+}
+/* Adjust length of skb with fragments to match received data */
+static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
+			  unsigned int length)
+{
+	int i, num_frags;
+	unsigned int size;
+
+	/* put header into skb */
+	size = min(length, hdr_space);
+	skb->tail += size;
+	skb->len += size;
+	length -= size;
+
+	num_frags = skb_shinfo(skb)->nr_frags;
+	for (i = 0; i < num_frags; i++) {
+		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		if (length == 0) {
+			/* don't need this page */
+			__free_page(frag->page);
+			--skb_shinfo(skb)->nr_frags;
+		} else {
+			size = min(length, (unsigned) PAGE_SIZE);
+
+			frag->size = size;
+			skb->data_len += size;
+			skb->truesize += size;
+			skb->len += size;
+			length -= size;
+		}
+	}
+}
+
+static void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_SRQ;
+	struct sk_buff *skb;
+	dma_addr_t mapping[IPOIB_CM_RX_SG];
+
+	ipoib_dbg_data(priv, "cm recv completion: id %d, op %d, status: %d\n",
+		       wr_id, wc->opcode, wc->status);
+
+	if (unlikely(wr_id >= ipoib_recvq_size)) {
+		ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
+			   wr_id, ipoib_recvq_size);
+		return;
+	}
+
+	skb  = priv->cm.srq_ring[wr_id].skb;
+
+	if (unlikely(wc->status != IB_WC_SUCCESS)) {
+		ipoib_dbg(priv, "cm recv error "
+			   "(status=%d, wrid=%d vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+		++priv->stats.rx_dropped;
+		goto repost;
+	}
+
+	if (unlikely(ipoib_cm_alloc_rx_skb(dev, wr_id, mapping))) {
+		/*
+		 * If we can't allocate a new RX buffer, dump
+		 * this packet and reuse the old buffer.
+		 */
+		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
+		++priv->stats.rx_dropped;
+		goto repost;
+	}
+
+	ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[wr_id].mapping);
+	memcpy(priv->cm.srq_ring[wr_id].mapping, mapping, sizeof mapping);
+
+	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
+		       wc->byte_len, wc->slid);
+
+	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len);
+
+	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
+	skb->mac.raw = skb->data;
+	skb_pull(skb, IPOIB_ENCAP_LEN);
+
+	dev->last_rx = jiffies;
+	++priv->stats.rx_packets;
+	priv->stats.rx_bytes += skb->len;
+
+	skb->dev = dev;
+	/* XXX get correct PACKET_ type here */
+	skb->pkt_type = PACKET_HOST;
+	netif_rx_ni(skb);
+
+repost:
+	if (unlikely(ipoib_cm_post_receive(dev, wr_id)))
+		ipoib_warn(priv, "ipoib_cm_post_receive failed "
+			   "for buf %d\n", wr_id);
+}
+
+void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr)
+{
+	struct net_device *dev = (struct net_device *) dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int n, i;
+
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	do {
+		n = ib_poll_cq(cq, IPOIB_NUM_WC, priv->cm.ibwc);
+		for (i = 0; i < n; ++i)
+			ipoib_cm_handle_rx_wc(dev, priv->cm.ibwc + i);
+	} while (n == IPOIB_NUM_WC);
+}
+
+static inline int post_send(struct ipoib_dev_priv *priv,
+			    struct ipoib_cm_tx *tx,
+			    unsigned int wr_id,
+			    dma_addr_t addr, int len)
+{
+	struct ib_send_wr *bad_wr;
+
+	priv->tx_sge.addr             = addr;
+	priv->tx_sge.length           = len;
+
+	priv->tx_wr.wr_id 	      = wr_id;
+
+	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+}
+
+void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_tx_buf *tx_req;
+	dma_addr_t addr;
+
+	if (unlikely(skb->len > tx->mtu)) {
+		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
+			   skb->len, tx->mtu);
+		++priv->stats.tx_dropped;
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return;
+	}
+
+	ipoib_dbg_data(priv, "sending packet %p, head %d length=%d connection=%p\n",
+		       skb, tx->tx_head, skb->len, tx);
+
+	/*
+	 * We put the skb into the tx_ring _before_ we call post_send()
+	 * because it's entirely possible that the completion handler will
+	 * run before we execute anything after the post_send().  That
+	 * means we have to make sure everything is properly recorded and
+	 * our state is consistent before we call post_send().
+	 */
+	tx_req = &tx->tx_ring[tx->tx_head & (ipoib_sendq_size - 1)];
+	tx_req->skb = skb;
+	addr = dma_map_single(priv->ca->dma_device, skb->data, skb->len,
+			      DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(addr))) {
+		++priv->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return;
+	}
+	pci_unmap_addr_set(tx_req, mapping, addr);
+
+	if (unlikely(post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
+			        addr, skb->len))) {
+		ipoib_warn(priv, "post_send failed\n");
+		++priv->stats.tx_errors;
+		dma_unmap_single(priv->ca->dma_device, addr, skb->len,
+				 DMA_TO_DEVICE);
+		dev_kfree_skb_any(skb);
+	} else {
+		dev->trans_start = jiffies;
+		++tx->tx_head;
+
+		if (tx->tx_head - tx->tx_tail == ipoib_sendq_size) {
+			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
+			netif_stop_queue(dev);
+			set_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags);
+		}
+	}
+}
+
+static void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ipoib_cm_tx *tx,
+				  struct ib_wc *wc)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned int wr_id = wc->wr_id;
+	struct ipoib_tx_buf *tx_req;
+	unsigned long flags;
+
+	ipoib_dbg_data(priv, "cm send completion: id %d, op %d, status: %d\n",
+		       wr_id, wc->opcode, wc->status);
+
+	if (unlikely(wr_id >= ipoib_sendq_size)) {
+		ipoib_warn(priv, "cm send completion event with wrid %d (> %d)\n",
+			   wr_id, ipoib_sendq_size);
+		return;
+	}
+
+	tx_req = &tx->tx_ring[wr_id];
+
+	dma_unmap_single(priv->ca->dma_device,
+			 pci_unmap_addr(tx_req, mapping),
+			 tx_req->skb->len,
+			 DMA_TO_DEVICE);
+
+	/* FIXME: is this right? Shouldn't we only increment on success? */
+	++priv->stats.tx_packets;
+	priv->stats.tx_bytes += tx_req->skb->len;
+
+	dev_kfree_skb_any(tx_req->skb);
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	++tx->tx_tail;
+	if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &tx->flags) &&
+	    tx->tx_head - tx->tx_tail <= ipoib_sendq_size >> 1) {
+		netif_wake_queue(dev);
+	}
+
+	if (wc->status != IB_WC_SUCCESS &&
+	    wc->status != IB_WC_WR_FLUSH_ERR) {
+		struct ipoib_neigh *neigh;
+
+		ipoib_dbg(priv, "failed cm send event "
+			   "(status=%d, wrid=%d vend_err %x)\n",
+			   wc->status, wr_id, wc->vendor_err);
+
+		spin_lock(&priv->lock);
+	       	neigh = tx->neigh;
+
+		if (neigh) {
+			neigh->cm = NULL;
+			list_del(&neigh->list);
+			if (neigh->ah)
+				ipoib_put_ah(neigh->ah);
+			ipoib_neigh_free(neigh);
+
+			tx->neigh = NULL;
+		}
+		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+			list_move(&tx->list, &priv->cm.reap_list);
+			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		}
+
+		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
+
+		spin_unlock(&priv->lock);
+	}
+
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
+{
+	struct ipoib_cm_tx *tx = tx_ptr;
+	int n, i;
+
+	ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
+	do {
+		n = ib_poll_cq(cq, IPOIB_NUM_WC, tx->ibwc);
+		for (i = 0; i < n; ++i)
+			ipoib_cm_handle_tx_wc(tx->dev, tx, tx->ibwc + i);
+	} while (n == IPOIB_NUM_WC);
+}
+
+int ipoib_cm_dev_open(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int ret;
+
+	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
+		return 0;
+
+	priv->cm.cq = ib_create_cq(priv->ca, ipoib_cm_rx_completion, NULL, dev,
+				   ipoib_recvq_size + 1);
+	if (IS_ERR(priv->cm.cq)) {
+		printk(KERN_WARNING "%s: failed to create CQ\n", priv->ca->name);
+		return PTR_ERR(priv->cm.cq);
+	}
+
+	ib_req_notify_cq(priv->cm.cq, IB_CQ_NEXT_COMP);
+
+	priv->cm.id = ib_create_cm_id(priv->ca, ipoib_cm_rx_handler, dev);
+	if (IS_ERR(priv->cm.id)) {
+		printk(KERN_WARNING "%s: failed to create CM ID\n", priv->ca->name);
+		ib_destroy_cq(priv->cm.cq);
+		return IS_ERR(priv->cm.id);
+	}
+
+	ret = ib_cm_listen(priv->cm.id, cpu_to_be64(IPOIB_CM_IETF_ID | priv->qp->qp_num),
+			   0, NULL);
+	if (ret) {
+		printk(KERN_WARNING "%s: failed to listen on ID 0x%llx\n", priv->ca->name,
+		       IPOIB_CM_IETF_ID | priv->qp->qp_num);
+		ib_destroy_cm_id(priv->cm.id);
+		ib_destroy_cq(priv->cm.cq);
+		return ret;
+	}
+	return 0;
+}
+
+void ipoib_cm_dev_stop(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_rx *p;
+	unsigned long flags;
+
+	if (!IPOIB_CM_SUPPORTED(dev->dev_addr))
+		return;
+
+	ib_destroy_cm_id(priv->cm.id);
+	spin_lock_irqsave(&priv->lock, flags);
+	while (!list_empty(&priv->cm.passive_ids)) {
+		p = list_entry(priv->cm.passive_ids.next, typeof(*p), list);
+		list_del_init(&p->list);
+		spin_unlock_irqrestore(&priv->lock, flags);
+		ib_destroy_cm_id(p->id);
+		ib_destroy_qp(p->qp);
+		kfree(p);
+		spin_lock_irqsave(&priv->lock, flags);
+	}
+	spin_unlock_irqrestore(&priv->lock, flags);
+	ib_destroy_cq(priv->cm.cq);
+}
+
+static int ipoib_cm_rep_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_tx *p = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	struct ipoib_cm_data *data = event->private_data;
+	struct sk_buff_head skqueue;
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+	struct sk_buff *skb;
+	unsigned long flags;
+
+	p->mtu = be32_to_cpu(data->mtu);
+
+	if (p->mtu < priv->dev->mtu + IPOIB_ENCAP_LEN) {
+		ipoib_warn(priv, "Rejecting connection: mtu %d < device mtu %d + 4\n",
+			   p->mtu, priv->dev->mtu);
+		return -EINVAL;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTR;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTR: %d\n", ret);
+		return ret;
+	}
+
+	qp_attr.rq_psn = 0 /* FIXME */;
+	ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTR: %d\n", ret);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_RTS;
+	ret = ib_cm_init_qp_attr(cm_id, &qp_attr, &qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to init QP attr for RTS: %d\n", ret);
+		return ret;
+	}
+	ret = ib_modify_qp(p->qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify QP to RTS: %d\n", ret);
+		return ret;
+	}
+
+	skb_queue_head_init(&skqueue);
+
+	spin_lock_irqsave(&priv->lock, flags);
+	set_bit(IPOIB_FLAG_OPER_UP, &p->flags);
+	if (p->neigh)
+		while ((skb = __skb_dequeue(&p->neigh->queue)))
+			__skb_queue_tail(&skqueue, skb);
+	spin_unlock_irqrestore(&priv->lock, flags);
+
+	while ((skb = __skb_dequeue(&skqueue))) {
+		skb->dev = p->dev;
+		if (dev_queue_xmit(skb))
+			ipoib_warn(priv, "dev_queue_xmit failed "
+				   "to requeue packet\n");
+	}
+
+	ret = ib_send_cm_rtu(cm_id, NULL, 0);
+	if (ret) {
+		ipoib_warn(priv, "failed to send RTU: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ib_cq *cq)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr attr = {};
+	attr.recv_cq = priv->cm.cq;
+	attr.srq = priv->cm.srq;
+	attr.cap.max_send_wr = ipoib_sendq_size;
+	attr.cap.max_send_sge = 1;
+	attr.sq_sig_type = IB_SIGNAL_ALL_WR;
+	attr.qp_type = IB_QPT_RC;
+	attr.send_cq = cq;
+	return ib_create_qp(priv->pd, &attr);
+}
+
+static int ipoib_cm_send_req(struct net_device *dev,
+			     struct ib_cm_id *id, struct ib_qp *qp,
+			     u32 qpn,
+			     struct ib_sa_path_rec *pathrec)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_data data = {};
+	struct ib_cm_req_param req = {};
+
+	data.qpn = cpu_to_be32(priv->qp->qp_num);
+	data.mtu = cpu_to_be32(IPOIB_CM_BUF_SIZE);
+
+	req.primary_path 	      = pathrec;
+	req.alternate_path 	      = NULL;
+	req.service_id                = cpu_to_be64(IPOIB_CM_IETF_ID | qpn);
+	req.qp_num 		      = qp->qp_num;
+	req.qp_type 		      = qp->qp_type;
+	req.private_data 	      = &data;
+	req.private_data_len 	      = sizeof data;
+	req.flow_control 	      = 0;
+
+	req.starting_psn              = 0; /* FIXME */
+
+	/*
+	 * Pick some arbitrary defaults here; we could make these
+	 * module parameters if anyone cared about setting them.
+	 */
+	req.responder_resources	      = 4;
+	req.remote_cm_response_timeout = 20;
+	req.local_cm_response_timeout  = 20;
+	req.retry_count 	      = 0; /* RFC draft warns against retries */
+	req.rnr_retry_count 	      = 0; /* RFC draft warns against retries */
+	req.max_cm_retries 	      = 15;
+	req.srq 	              = 15;
+	return ib_send_cm_req(id, &req);
+}
+
+static int ipoib_cm_modify_tx_init(struct net_device *dev,
+				  struct ib_cm_id *cm_id, struct ib_qp *qp)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int qp_attr_mask, ret;
+	ret = ib_find_cached_pkey(priv->ca, priv->port, priv->pkey, &qp_attr.pkey_index);
+	if (ret) {
+		ipoib_warn(priv, "pkey 0x%x not in cache: %d\n", priv->pkey, ret);
+		return ret;
+	}
+
+	qp_attr.qp_state = IB_QPS_INIT;
+	qp_attr.qp_access_flags = IB_ACCESS_LOCAL_WRITE;
+	qp_attr.port_num = priv->port;
+	qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT;
+
+	ret = ib_modify_qp(qp, &qp_attr, qp_attr_mask);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify tx QP to INIT: %d\n", ret);
+		return ret;
+	}
+	return 0;
+}
+
+int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	int ret;
+
+	ipoib_dbg(priv, "Request connection %p for gid " IPOIB_GID_FMT " qpn 0x%x\n",
+		  p, IPOIB_GID_ARG(pathrec->dgid), qpn);
+
+	p->tx_ring = kzalloc(ipoib_sendq_size * sizeof *p->tx_ring,
+				GFP_KERNEL);
+	if (!p->tx_ring) {
+		ipoib_warn(priv, "failed to allocate tx ring\n");
+		ret = -ENOMEM;
+		goto err_tx;
+	}
+
+	p->cq = ib_create_cq(priv->ca, ipoib_cm_tx_completion, NULL, p,
+			     ipoib_sendq_size + 1);
+	if (IS_ERR(p->cq)) {
+		ret = PTR_ERR(p->cq);
+		ipoib_warn(priv, "failed to allocate tx cq: %d\n", ret);
+		goto err_cq;
+	}
+
+	ret = ib_req_notify_cq(p->cq, IB_CQ_NEXT_COMP);
+	if (ret) {
+		ipoib_warn(priv, "failed to request completion notification: %d\n", ret);
+		goto err_req_notify;
+	}
+
+	p->qp = ipoib_cm_create_tx_qp(p->dev, p->cq);
+	if (IS_ERR(p->qp)) {
+		ret = PTR_ERR(p->qp);
+		ipoib_warn(priv, "failed to allocate tx qp: %d\n", ret);
+		goto err_qp;
+	}
+
+	p->id = ib_create_cm_id(priv->ca, ipoib_cm_tx_handler, p);
+	if (IS_ERR(p->id)) {
+		ret = PTR_ERR(p->id);
+		ipoib_warn(priv, "failed to create tx cm id: %d\n", ret);
+		goto err_id;
+	}
+
+	ret = ipoib_cm_modify_tx_init(p->dev, p->id,  p->qp);
+	if (ret) {
+		ipoib_warn(priv, "failed to modify tx qp to rtr: %d\n", ret);
+		goto err_modify;
+	}
+
+	ret = ipoib_cm_send_req(p->dev, p->id, p->qp, qpn, pathrec);
+	if (ret) {
+		ipoib_warn(priv, "failed to send cm req: %d\n", ret);
+		goto err_send_cm;
+	}
+	return 0;
+
+err_send_cm:
+err_modify:
+	ib_destroy_cm_id(p->id);
+err_id:
+	p->id = NULL;
+	ib_destroy_qp(p->qp);
+err_req_notify:
+err_qp:
+	p->qp = NULL;
+	ib_destroy_cq(p->cq);
+err_cq:
+	p->cq = NULL;
+err_tx:
+	return ret;
+}
+
+void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
+	struct ipoib_tx_buf *tx_req;
+
+	ipoib_dbg(priv, "Destroy active connection %p. head 0x%x tail 0x%x\n",
+		  p, p->tx_head, p->tx_tail);
+
+	if (p->id)
+		ib_destroy_cm_id(p->id);
+
+	if (p->qp)
+		ib_destroy_qp(p->qp);
+
+	if (p->cq)
+		ib_destroy_cq(p->cq);
+
+	if (p->tx_ring) {
+		while ((int) p->tx_tail - (int) p->tx_head < 0) {
+			tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
+			dma_unmap_single(priv->ca->dma_device,
+					 pci_unmap_addr(tx_req, mapping),
+					 tx_req->skb->len,
+					 DMA_TO_DEVICE);
+			dev_kfree_skb_any(tx_req->skb);
+			++p->tx_tail;
+		}
+
+		kfree(p->tx_ring);
+	}
+
+	kfree(p);
+}
+
+int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+{
+	struct ipoib_cm_tx *tx = cm_id->context;
+	struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
+	struct ipoib_neigh *neigh;
+	unsigned long flags;
+	int ret;
+
+	switch (event->event) {
+	case IB_CM_DREQ_RECEIVED:
+		ipoib_dbg(priv, "DREQ received.\n");
+		ib_send_cm_drep(cm_id, NULL, 0);
+		break;
+	case IB_CM_REP_RECEIVED:
+		ipoib_dbg(priv, "REP received.\n");
+		ret = ipoib_cm_rep_handler(cm_id, event);
+		if (ret)
+			ib_send_cm_rej(cm_id, IB_CM_REJ_CONSUMER_DEFINED,
+				       NULL, 0, NULL, 0);
+		break;
+	case IB_CM_REQ_ERROR:
+	case IB_CM_REJ_RECEIVED:
+	case IB_CM_TIMEWAIT_EXIT:
+		ipoib_dbg(priv, "CM error %d.\n", event->event);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+	       	neigh = tx->neigh;
+
+		if (neigh) {
+			neigh->cm = NULL;
+			list_del(&neigh->list);
+			if (neigh->ah)
+				ipoib_put_ah(neigh->ah);
+			ipoib_neigh_free(neigh);
+
+			tx->neigh = NULL;
+		}
+
+		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+			list_move(&tx->list, &priv->cm.reap_list);
+			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		}
+
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+				       struct ipoib_neigh *neigh)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_tx *tx;
+
+       	tx = kzalloc(sizeof *tx, GFP_ATOMIC);
+	if (!tx)
+		return NULL;
+
+	neigh->cm = tx;
+	tx->neigh = neigh;
+	tx->path = path;
+	tx->dev = dev;
+	list_add(&tx->list, &priv->cm.start_list);
+	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
+	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	return tx;
+}
+
+void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
+	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
+		list_move(&tx->list, &priv->cm.reap_list);
+		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		ipoib_dbg(priv, "Reap connection for gid " IPOIB_GID_FMT "\n",
+			  IPOIB_GID_ARG(tx->neigh->dgid));
+		tx->neigh = NULL;
+	}
+}
+
+void ipoib_cm_tx_start(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_neigh *neigh;
+	struct ipoib_cm_tx *p;
+	unsigned long flags;
+	int ret;
+
+	struct ib_sa_path_rec pathrec;
+	u32 qpn;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
+	while (!list_empty(&priv->cm.start_list)) {
+		p = list_entry(priv->cm.start_list.next, typeof(*p), list);
+		list_del_init(&p->list);
+		neigh = p->neigh;
+		qpn = IPOIB_QPN(neigh->neighbour->ha);
+		memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		ret = ipoib_cm_tx_init(p, qpn, &pathrec);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+		if (ret) {
+			neigh = p->neigh;
+			if (neigh) {
+				neigh->cm = NULL;
+				list_del(&neigh->list);
+				if (neigh->ah)
+					ipoib_put_ah(neigh->ah);
+				ipoib_neigh_free(neigh);
+			}
+			list_del(&p->list);
+			kfree(p);
+		}
+	}
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+void ipoib_cm_tx_reap(void *dev_ptr)
+{
+	struct net_device *dev = dev_ptr;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cm_tx *p;
+	unsigned long flags;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	spin_lock(&priv->lock);
+	while (!list_empty(&priv->cm.reap_list)) {
+		p = list_entry(priv->cm.reap_list.next, typeof(*p), list);
+		list_del(&p->list);
+		spin_unlock(&priv->lock);
+		spin_unlock_irqrestore(&priv->tx_lock, flags);
+		ipoib_cm_tx_destroy(p);
+		spin_lock_irqsave(&priv->tx_lock, flags);
+		spin_lock(&priv->lock);
+	}
+	spin_unlock(&priv->lock);
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+}
+
+static ssize_t show_mode(struct class_device *cdev, char *buf)
+{
+	struct net_device *dev = container_of(cdev, struct net_device, class_dev);
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (test_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags))
+		return sprintf(buf, "connected\n");
+	else
+		return sprintf(buf, "datagram\n");
+}
+
+static ssize_t set_mode(struct class_device *cdev,
+			const char *buf, size_t count)
+{
+	struct net_device *dev = container_of(cdev, struct net_device, class_dev);
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	/* flush paths if we switch modes so that connections are restarted */
+	if (IPOIB_CM_SUPPORTED(dev->dev_addr) && !strcmp(buf, "connected\n")) {
+		set_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
+		ipoib_warn(priv, "enabling connected mode breaks multicast!\n");
+		ipoib_flush_paths(dev);
+		return count;
+	}
+
+	if (!strcmp(buf, "datagram\n")) {
+		clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
+		ipoib_flush_paths(dev);
+		return count;
+	}
+
+	return -EINVAL;
+}
+
+static CLASS_DEVICE_ATTR(mode, S_IWUGO | S_IRUGO, show_mode, set_mode);
+
+int ipoib_cm_add_mode_attr(struct net_device *dev)
+{
+	return class_device_create_file(&dev->class_dev, &class_device_attr_mode);
+}
+
+int ipoib_cm_dev_init(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_srq_init_attr srq_init_attr = {
+		.attr = {
+			.max_wr  = ipoib_recvq_size,
+			.max_sge = IPOIB_CM_RX_SG
+		}
+	};
+	int ret, i;
+
+	INIT_LIST_HEAD(&priv->cm.passive_ids);
+	INIT_LIST_HEAD(&priv->cm.reap_list);
+	INIT_LIST_HEAD(&priv->cm.start_list);
+	INIT_WORK(&priv->cm.start_task, ipoib_cm_tx_start, dev);
+	INIT_WORK(&priv->cm.reap_task, ipoib_cm_tx_reap, dev);
+
+	priv->cm.srq = ib_create_srq(priv->pd, &srq_init_attr);
+	if (IS_ERR(priv->cm.srq)) {
+		ret = PTR_ERR(priv->cm.srq);
+		priv->cm.srq = NULL;
+		return ret;
+	}
+
+	priv->cm.srq_ring = kzalloc(ipoib_recvq_size * sizeof *priv->cm.srq_ring,
+				    GFP_KERNEL);
+	if (!priv->cm.srq_ring) {
+		printk(KERN_WARNING "%s: failed to allocate CM ring (%d entries)\n",
+		       priv->ca->name, ipoib_recvq_size);
+		ipoib_cm_dev_cleanup(dev);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].lkey	= priv->mr->lkey;
+
+	priv->cm.rx_sge[0].length = IPOIB_CM_HEAD_SIZE;
+	for (i = 1; i < IPOIB_CM_RX_SG; ++i)
+		priv->cm.rx_sge[i].length = PAGE_SIZE;
+	priv->cm.rx_wr.next = NULL;
+	priv->cm.rx_wr.sg_list = priv->cm.rx_sge;
+	priv->cm.rx_wr.num_sge = IPOIB_CM_RX_SG;
+
+	for (i = 0; i < ipoib_recvq_size; ++i) {
+		if (ipoib_cm_alloc_rx_skb(dev, i, priv->cm.srq_ring[i].mapping)) {
+			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			return -ENOMEM;
+		}
+		if (ipoib_cm_post_receive(dev, i)) {
+			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+			ipoib_cm_dev_cleanup(dev);
+			return -EIO;
+		}
+	}
+
+	priv->dev->dev_addr[0] = IPOIB_FLAGS_RC;
+	return 0;
+}
+
+void ipoib_cm_dev_cleanup(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, ret;
+
+	if (!priv->cm.srq)
+		return;
+
+	ipoib_dbg(priv, "Cleanup ipoib connected mode.\n");
+
+	ret = ib_destroy_srq(priv->cm.srq);
+	if (ret)
+		ipoib_warn(priv, "ib_destroy_srq failed: %d\n", ret);
+
+	priv->cm.srq = NULL;
+	if (!priv->cm.srq_ring)
+		return;
+	for (i = 0; i < ipoib_recvq_size; ++i)
+		if (priv->cm.srq_ring[i].skb) {
+			ipoib_cm_dma_unmap_rx(priv, priv->cm.srq_ring[i].mapping);
+			dev_kfree_skb_any(priv->cm.srq_ring[i].skb);
+			priv->cm.srq_ring[i].skb = NULL;
+		}
+	kfree(priv->cm.srq_ring);
+	priv->cm.srq_ring = NULL;
+}
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 8bf5e9e..2372cfc 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -273,10 +273,10 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	spin_lock_irqsave(&priv->tx_lock, flags);
 	++priv->tx_tail;
-	if (netif_queue_stopped(dev) &&
-	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags) &&
-	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
+	if (test_and_clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags) &&
+	    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1) {
 		netif_wake_queue(dev);
+	}
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 	if (wc->status != IB_WC_SUCCESS &&
@@ -378,6 +378,7 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		if (priv->tx_head - priv->tx_tail == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
 			netif_stop_queue(dev);
+			set_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
 		}
 	}
 }
@@ -429,6 +430,13 @@ int ipoib_ib_dev_open(struct net_device *dev)
 		return -1;
 	}
 
+	ret = ipoib_cm_dev_open(dev);
+	if (ret) {
+		ipoib_warn(priv, "ipoib_ib_post_receives returned %d\n", ret);
+		ipoib_ib_dev_stop(dev);
+		return -1;
+	}
+
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
 	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task, HZ);
 
@@ -514,6 +522,8 @@ int ipoib_ib_dev_stop(struct net_device *dev)
 
 	clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags);
 
+	ipoib_cm_dev_stop(dev);
+
 	/*
 	 * Move our QP to the error state and then reinitialize in
 	 * when all work requests have completed or have been flushed.
@@ -603,6 +613,8 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		return -ENODEV;
 	}
 
+	ipoib_cm_dev_init(dev);
+
 	if (dev->flags & IFF_UP) {
 		if (ipoib_ib_dev_open(dev)) {
 			ipoib_transport_dev_cleanup(dev);
@@ -659,6 +671,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	ipoib_mcast_stop_thread(dev, 1);
 	ipoib_mcast_dev_flush(dev);
 
+	ipoib_cm_dev_cleanup(dev);
 	ipoib_transport_dev_cleanup(dev);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 85522da..5319ac1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -49,8 +49,6 @@
 
 #include <net/dst.h>
 
-#define IPOIB_QPN(ha) (be32_to_cpup((__be32 *) ha) & 0xffffff)
-
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("IP-over-InfiniBand net driver");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -145,6 +143,8 @@ static int ipoib_stop(struct net_device *dev)
 
 	netif_stop_queue(dev);
 
+	clear_bit(IPOIB_FLAG_NETIF_STOPPED, &priv->flags);
+
 	/*
 	 * Now flush workqueue to make sure a scheduled task doesn't
 	 * bring our internal state back up.
@@ -178,8 +178,17 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN)
+	/* dev->mtu > 2K ==> connected mode */
+	if (ipoib_cm_admin_enabled(dev) && new_mtu <= IPOIB_CM_MTU) {
+		if (new_mtu > priv->mcast_mtu)
+			ipoib_warn(priv, "mtu > %d breaks multicast!\n", priv->mcast_mtu);
+		dev->mtu = new_mtu;
+		return 0;
+	}
+
+	if (new_mtu > IPOIB_PACKET_SIZE - IPOIB_ENCAP_LEN) {
 		return -EINVAL;
+	}
 
 	priv->admin_mtu = new_mtu;
 
@@ -414,6 +423,20 @@ static void path_rec_completion(int status,
 			memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
 			       sizeof(union ib_gid));
 
+			if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+				if (!ipoib_cm_get(neigh))
+					ipoib_cm_set(neigh, ipoib_cm_create_tx(dev,
+									       path,
+									       neigh));
+				if (!ipoib_cm_get(neigh)) {
+					list_del(&neigh->list);
+					if (neigh->ah)
+						ipoib_put_ah(neigh->ah);
+					ipoib_neigh_free(neigh);
+					continue;
+				}
+			}
+
 			while ((skb = __skb_dequeue(&neigh->queue)))
 				__skb_queue_tail(&skqueue, skb);
 		}
@@ -522,7 +545,25 @@ static void neigh_add_path(struct sk_buff *skb, struct net_device *dev)
 		memcpy(&neigh->dgid.raw, &path->pathrec.dgid.raw,
 		       sizeof(union ib_gid));
 
-		ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
+		if (ipoib_cm_enabled(dev, neigh->neighbour)) {
+			if (!ipoib_cm_get(neigh))
+				ipoib_cm_set(neigh, ipoib_cm_create_tx(dev, path, neigh));
+			if (!ipoib_cm_get(neigh)) {
+				list_del(&neigh->list);
+				if (neigh->ah)
+					ipoib_put_ah(neigh->ah);
+				ipoib_neigh_free(neigh);
+				goto err_drop;
+			}
+			if (skb_queue_len(&neigh->queue) < IPOIB_MAX_PATH_REC_QUEUE)
+				__skb_queue_tail(&neigh->queue, skb);
+			else {
+				ipoib_warn(priv, "queue length limit %d. Packet drop.\n",
+					   skb_queue_len(&neigh->queue));
+				goto err_drop;
+			}
+		} else
+			ipoib_send(dev, skb, path->ah, IPOIB_QPN(skb->dst->neighbour->ha));
 	} else {
 		neigh->ah  = NULL;
 		__skb_queue_tail(&neigh->queue, skb);
@@ -539,6 +580,7 @@ err_list:
 
 err_path:
 	ipoib_neigh_free(neigh);
+err_drop:
 	++priv->stats.tx_dropped;
 	dev_kfree_skb_any(skb);
 
@@ -641,7 +683,12 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 		neigh = *to_ipoib_neigh(skb->dst->neighbour);
 
-		if (likely(neigh->ah)) {
+		if (ipoib_cm_get(neigh)) {
+			if (ipoib_cm_up(neigh)) {
+				ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
+				goto out;
+			}
+		} else if (neigh->ah) {
 			if (unlikely(memcmp(&neigh->dgid.raw,
 					    skb->dst->neighbour->ha + 4,
 					    sizeof(union ib_gid)))) {
@@ -805,6 +852,7 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
 
 	neigh->neighbour = neighbour;
 	*to_ipoib_neigh(neighbour) = neigh;
+	ipoib_cm_set(neigh, NULL);
 
 	return neigh;
 }
@@ -812,6 +860,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neighbour)
 void ipoib_neigh_free(struct ipoib_neigh *neigh)
 {
 	*to_ipoib_neigh(neigh->neighbour) = NULL;
+	if (ipoib_cm_get(neigh))
+		ipoib_cm_destroy_tx(ipoib_cm_get(neigh));
 	kfree(neigh);
 }
 
@@ -1075,6 +1125,8 @@ static struct net_device *ipoib_add_port(const char *format,
 
 	ipoib_create_debug_files(priv->dev);
 
+	if (ipoib_cm_add_mode_attr(priv->dev))
+		goto sysfs_failed;
 	if (ipoib_add_pkey_attr(priv->dev))
 		goto sysfs_failed;
 	if (class_device_create_file(&priv->dev->class_dev,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 3faa182..ea387b3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -594,7 +594,9 @@ void ipoib_mcast_join_task(void *dev_ptr)
 
 	priv->mcast_mtu = ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu) -
 		IPOIB_ENCAP_LEN;
-	dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
+
+	if (!ipoib_cm_admin_enabled(dev))
+		dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
 
 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index f887780..d9fd82d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -115,6 +115,8 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 
 	ipoib_create_debug_files(priv->dev);
 
+	if (ipoib_cm_add_mode_attr(priv->dev))
+		goto sysfs_failed;
 	if (ipoib_add_pkey_attr(priv->dev))
 		goto sysfs_failed;
 
-- 
MST


From halr at voltaire.com  Sun Dec 10 08:10:42 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Dec 2006 11:10:42 -0500
Subject: [openib-general] [PATCH] osm: trivial osm_log missmatch on
	vendor mlx
In-Reply-To: <457B1815.7000404@mellanox.co.il>
References: <457B1815.7000404@mellanox.co.il>
Message-ID: <1165767007.26559.111479.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 15:09, Eitan Zahavi wrote:
> Hi Hal
> 
> This patch fixes some osm_log issues on the mlx vendor.
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> ---
>  osm/libvendor/osm_vendor_mlx_dispatcher.c |    3 ++-
>  osm/libvendor/osm_vendor_mlx_txn.c        |    2 +-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/osm/libvendor/osm_vendor_mlx_dispatcher.c 
> b/osm/libvendor/osm_vendor_mlx_dispatcher.c
> index e8b47dd..7e3bd78 100644
> --- a/osm/libvendor/osm_vendor_mlx_dispatcher.c
> +++ b/osm/libvendor/osm_vendor_mlx_dispatcher.c
> @@ -134,7 +134,8 @@ osmv_dispatch_mad(IN osm_bind_handle_t
>    {
> 
>      osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG,
> -            "The bind handle %p is being closed. The MAD will not be 
> dispatched.\n");

This line is wrapped.

> +            "The bind handle %p is being closed. "
> +            "The MAD will not be dispatched.\n", p_bo);
> 
>      ret = IB_INTERRUPTED;
>      goto dispatch_mad_done;
> diff --git a/osm/libvendor/osm_vendor_mlx_txn.c 
> b/osm/libvendor/osm_vendor_mlx_txn.c
> index 1fd262f..234e33b 100644
> --- a/osm/libvendor/osm_vendor_mlx_txn.c
> +++ b/osm/libvendor/osm_vendor_mlx_txn.c
> @@ -631,7 +631,7 @@ __osmv_txn_timeout_cb(IN uint64_t key,
> 
>          osm_log(p_bo->p_vendor->p_log, OSM_LOG_DEBUG,
>                  "__osmv_txn_timeout_cb: "
> -                "Retry request timout in : %u [msec].\n",
> +                "Retry request timout in : %lu [msec].\n",
>                  next_timeout_ms);
>        }
>      }
> --
> 1.4.4.1.GIT

Thanks. Applied with osm_vendor_mlx_dispatcher.c done by hand.

-- Hal


From halr at voltaire.com  Sun Dec 10 08:20:55 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 10 Dec 2006 11:20:55 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061210064346.GC10403@mellanox.co.il>
References: <1165701912.26559.65050.camel@hal.voltaire.com>
	<20061210064346.GC10403@mellanox.co.il>
Message-ID: <1165767642.26559.111898.camel@hal.voltaire.com>

On Sun, 2006-12-10 at 01:43, Michael S. Tsirkin wrote:
> > > > Eitan, how is it hard for you to prepare procmail's rule which
> will
> > > > automatically apply the patches from emails to the local
> pre-trunk
> > > > tree? Or do you think it is insufficient?
> > >
> > > This sounds like a fragile process. It seems much easier to just
> > > have an unstable branch with untested patches. No?
> >
> > Untested is an overexaggeration. They are tested but not by Eitan's
> > regression.
> 
> Sorry, I'm not trying to influence any policy decisions here,
> I'm coming purely from git angle. *If* you want Eitan to test and Ack
> some
> patches, *and want to automate the testing part*, the simplest thing
> to do is to
> apply them on some git branch.

Couldn't he also back off the head on the "trunk" if that doesn't work
too ? That (which version) could be taken as input to the automatic
regression with less overhead than another branch to have to track or
figuring out how to apply patches automagically.

-- Hal

> --
> MST
> 
> 


From rdreier at cisco.com  Sun Dec 10 09:34:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 10 Dec 2006 09:34:18 -0800
Subject: [openib-general] version #defines for the kernel
References: <045401c71b02$d8d17a40$0281a8c0@ebpc>
	<adapsau6t1p.fsf@cisco.com> <20061209193443.GB6891@mellanox.co.il>
Message-ID: <adafybn4put.fsf@cisco.com>

 > include/net/ieee80211.h has one. It does not seem to work too well though.

Do you mean

#define IEEE80211_VERSION "git-1.1.13"

The only thing it seems useful for is printing out -- you certainly
can't compare a string like that in any sane way using the C
preprocessor.

 - R.


From rdreier at cisco.com  Sun Dec 10 09:39:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 10 Dec 2006 09:39:18 -0800
Subject: [openib-general] version #defines for the kernel
References: <200612071104.kB7B4MTv009628@robert.bartonsoftware.com>
	<ada4ps79vh0.fsf@cisco.com> <20061208233616.GA10646@greglaptop>
Message-ID: <adaac1v4pmh.fsf@cisco.com>

 > > But you should also cope with
 > > non-OFED (vanilla upstream) drivers, probably by testing
 > > LINUX_VERSION_CODE too I suppose.
 > 
 > Although RHEL4 shows how this can break down in the future... they
 > backport kernel stuff while leaving LINUX_VERSION_CODE set to 2.6.9.

I don't think there's any sane way to handle that.  Since a backport
might only pick part of the new interface and stick with an old API
elsewhere, you can't have a single IB version number.  And I don't
want an ever-growing mass of "#define HAVE_FEATURE_BLAH" metastasizing
in the IB headers...


From mst at mellanox.co.il  Sun Dec 10 10:29:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 20:29:54 +0200
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <adafybn4put.fsf@cisco.com>
References: <adafybn4put.fsf@cisco.com>
Message-ID: <20061210182954.GB1708@mellanox.co.il>

>  > include/net/ieee80211.h has one. It does not seem to work too well though.
> 
> Do you mean
> 
> #define IEEE80211_VERSION "git-1.1.13"

Yes.

> The only thing it seems useful for is printing out -- you certainly
> can't compare a string like that in any sane way using the C
> preprocessor.

Right. Intel's out of tree drivers which I looked at at some point
try to run scripts to parse this, and fail miserably.

-- 
MST


From mst at mellanox.co.il  Sun Dec 10 11:54:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 21:54:05 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165767642.26559.111898.camel@hal.voltaire.com>
References: <1165701912.26559.65050.camel@hal.voltaire.com>
	<20061210064346.GC10403@mellanox.co.il>
	<1165767642.26559.111898.camel@hal.voltaire.com>
Message-ID: <20061210195405.GE1708@mellanox.co.il>

> > > > > Eitan, how is it hard for you to prepare procmail's rule which will
> > > > > automatically apply the patches from emails to the local pre-trunk
> > > > > tree? Or do you think it is insufficient?
> > > >
> > > > This sounds like a fragile process. It seems much easier to just
> > > > have an unstable branch with untested patches. No?
> > >
> > > Untested is an overexaggeration. They are tested but not by Eitan's
> > > regression.
> > 
> > Sorry, I'm not trying to influence any policy decisions here, I'm coming
> > purely from git angle. *If* you want Eitan to test and Ack some patches,
> > *and want to automate the testing part*, the simplest thing to do is
> > to apply them on some git branch.
> 
> Couldn't he also back off the head on the "trunk" if that doesn't work
> too ? That (which version) could be taken as input to the automatic
> regression with less overhead than another branch to have to track or
> figuring out how to apply patches automagically.

No, this is backwards - rewinding history in trunk branch will break git pull for anyone
who tries to base his work on that, so that's not a good idea.
Or you get a lot of little
"feature X"
"unbreak feature X"
...
"fix feature X"
commits that just make the history log messy and unreadable.

Guys, don't be so scared of branches, they don't really have
any significant overhead in git: branch (and tag) are basically
just symbolic names for commit.

There's not "maintainance" associated with it that I know of. Try it.
This is how e.g. git itself seems to be developed: there's a main branch for next release,
next branch for less stable stuff and "pu" branch for experimental stuff,
and there's a bugfix branch for last stable release.

-- 
MST


From sashak at voltaire.com  Sun Dec 10 12:52:03 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 10 Dec 2006 22:52:03 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061210195405.GE1708@mellanox.co.il>
References: <1165701912.26559.65050.camel@hal.voltaire.com>
	<20061210064346.GC10403@mellanox.co.il>
	<1165767642.26559.111898.camel@hal.voltaire.com>
	<20061210195405.GE1708@mellanox.co.il>
Message-ID: <20061210205203.GA21155@sashak.voltaire.com>

On 21:54 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > > > > > Eitan, how is it hard for you to prepare procmail's rule which will
> > > > > > automatically apply the patches from emails to the local pre-trunk
> > > > > > tree? Or do you think it is insufficient?
> > > > >
> > > > > This sounds like a fragile process. It seems much easier to just
> > > > > have an unstable branch with untested patches. No?
> > > >
> > > > Untested is an overexaggeration. They are tested but not by Eitan's
> > > > regression.
> > > 
> > > Sorry, I'm not trying to influence any policy decisions here, I'm coming
> > > purely from git angle. *If* you want Eitan to test and Ack some patches,
> > > *and want to automate the testing part*, the simplest thing to do is
> > > to apply them on some git branch.
> > 
> > Couldn't he also back off the head on the "trunk" if that doesn't work
> > too ? That (which version) could be taken as input to the automatic
> > regression with less overhead than another branch to have to track or
> > figuring out how to apply patches automagically.
> 
> No, this is backwards - rewinding history in trunk branch will break git pull for anyone
> who tries to base his work on that, so that's not a good idea.

I think Hal was about rewinding local tree, there is nothing wrong with
it.

In general non-linear history changes in public repositories are not
something "impossible", basically this should work, but may require
additional merging efforts from pullers.

I also think that it is better to not do it, at least not now.

> Or you get a lot of little
> "feature X"
> "unbreak feature X"
> ...
> "fix feature X"
> commits that just make the history log messy and unreadable.
> 
> Guys, don't be so scared of branches, they don't really have
> any significant overhead in git: branch (and tag) are basically
> just symbolic names for commit.

Right, branch in git is cheap, and if one needs branch in his tree he can
just create this branch in his tree, it is not necessary to ask origin's
tree maintainer to create this branch for him.

Sasha

> 
> There's not "maintainance" associated with it that I know of. Try it.
> This is how e.g. git itself seems to be developed: there's a main branch for next release,
> next branch for less stable stuff and "pu" branch for experimental stuff,
> and there's a bugfix branch for last stable release.
> 
> -- 
> MST
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From mst at mellanox.co.il  Sun Dec 10 13:05:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 23:05:43 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061210205203.GA21155@sashak.voltaire.com>
References: <20061210205203.GA21155@sashak.voltaire.com>
Message-ID: <20061210210543.GB9205@mellanox.co.il>

> > Guys, don't be so scared of branches, they don't really have
> > any significant overhead in git: branch (and tag) are basically
> > just symbolic names for commit.
> 
> Right, branch in git is cheap, and if one needs branch in his tree he can
> just create this branch in his tree, it is not necessary to ask origin's
> tree maintainer to create this branch for him.

I agree.
If, on the other hand, the tree maintainer wants someone else to test a set of
patches automatically, the simplest way is for *said maintainer* to create a
known branch or tag with that patch set, and have test scripts pick that up.

It's really simple - if you want people to help, make it easy on them.

-- 
MST


From adit.262 at gmail.com  Sun Dec 10 13:26:18 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Sun, 10 Dec 2006 16:26:18 -0500
Subject: [openib-general] Assigning IP addresses to IB interfaces
In-Reply-To: <457BBA6A.3020209@voltaire.com>
References: <d2ad857f0612081431q6decd412o2718019aaed1ae03@mail.gmail.com>
	<457BBA6A.3020209@voltaire.com>
Message-ID: <d2ad857f0612101326n71f9aa2dk9cf82afc0cefb1d4@mail.gmail.com>

I tried assigining IP addresses to IB interfaces - ifconfig ib1 10.0.0.1
ifconfig ib1 10.0.0.2 on the other machine
Did a "ping 10.0.0.2 -I ib1" from the first - it says destination host
unreachable.

Is there anything specific to be done for being able to ping between
the 2 interfaces?

Thanks,
Adit

On 12/10/06, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> Adit Ranadive wrote:
> > I have installed the OpenIB gen2 driver but the IB interfaces havent
> > been assigned any IP addresses..
> > Is it possible to assign them ip addresses using ifconfig and ping
> > between the interfaces of two machines?
>
> yes
>
>


-- 
Adit Ranadive
Freshman,
Georgia Institute of Technology,
Atlanta, GA


From mst at mellanox.co.il  Sun Dec 10 13:39:20 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 23:39:20 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061129214302.GF18978@sashak.voltaire.com>
References: <1164829955.28427.69.camel@stevo-desktop>
	<20061129203916.GL16763@sashak.voltaire.com>
	<1164835084.28427.83.camel@stevo-desktop>
	<20061129214302.GF18978@sashak.voltaire.com>
Message-ID: <20061210213920.GF9205@mellanox.co.il>

> Sean, you can do
> 
>   chmod 755 hooks/post-update
> 
> This hook runs git-server-update-info after each push.

It seems we really want this as default.
Sasha, could you please
chmod 755 /usr/share/git-core/templates/hooks/pre-commit
so that this will be the default for all new users?

-- 
MST


From mst at mellanox.co.il  Sun Dec 10 13:40:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 23:40:35 +0200
Subject: [openib-general] Assigning IP addresses to IB interfaces
In-Reply-To: <d2ad857f0612101326n71f9aa2dk9cf82afc0cefb1d4@mail.gmail.com>
References: <d2ad857f0612081431q6decd412o2718019aaed1ae03@mail.gmail.com>
	<457BBA6A.3020209@voltaire.com>
	<d2ad857f0612101326n71f9aa2dk9cf82afc0cefb1d4@mail.gmail.com>
Message-ID: <20061210214035.GG9205@mellanox.co.il>

Any chance that SM isn't running on the fabric?
Did the ports come up?

Quoting r. Adit Ranadive <adit.262 at gmail.com>:
Subject: Re: Assigning IP addresses to IB interfaces

I tried assigining IP addresses to IB interfaces - ifconfig ib1 10.0.0.1
ifconfig ib1 10.0.0.2 on the other machine
Did a "ping 10.0.0.2 -I ib1" from the first - it says destination host
unreachable.

Is there anything specific to be done for being able to ping between
the 2 interfaces?

Thanks,
Adit

On 12/10/06, Or Gerlitz <ogerlitz at voltaire.com> wrote:
> Adit Ranadive wrote:
> > I have installed the OpenIB gen2 driver but the IB interfaces havent
> > been assigned any IP addresses..
> > Is it possible to assign them ip addresses using ifconfig and ping
> > between the interfaces of two machines?
>
> yes
>
>


-- 
Adit Ranadive
Freshman,
Georgia Institute of Technology,
Atlanta, GA

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From sashak at voltaire.com  Sun Dec 10 13:50:33 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 10 Dec 2006 23:50:33 +0200
Subject: [openib-general] userspace git trees
Message-ID: <20061210215033.GC21155@sashak.voltaire.com>

Hi,

Recently I found this OFA 'Userspace Git Trees' downloading howto:

https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories

and thought that we could make it simpler for end-user to choose the
"right" git tree just by adding one more series of symbolic links under
/pub/scm. This links will point to the maintainer's "official" trees, and
we will have only one such link per project.

So typical downloading howto for end-users will looks like:

  git clone git://staging.openfabrics.org/dapl
  git clone git://staging.openfabrics.org/ibutils
  git clone git://staging.openfabrics.org/imgen
  ...

instead of

  git clone git://staging.openfabrics.org/~ardavis/dapl
  git clone git://staging.openfabrics.org/~eitan/ibutils
  git clone git://staging.openfabrics.org/~mst/imgen
  ...

as it is now.


To illustrate this I've added already couple of such symbolic links
under /pub/scm and it is visible now via gitweb:

  http://staging.openfabrics.org/git

Comments, objections?


(I did this just to show how this looks and probably missed some
projects. And of course I will remove those links if this idea will be
rejected.)

Sasha


From mst at mellanox.co.il  Sun Dec 10 13:59:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 10 Dec 2006 23:59:57 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210215033.GC21155@sashak.voltaire.com>
References: <20061210215033.GC21155@sashak.voltaire.com>
Message-ID: <20061210215956.GI9205@mellanox.co.il>

> Recently I found this OFA 'Userspace Git Trees' downloading howto:
> 
> https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> 
> and thought that we could make it simpler for end-user to choose the
> "right" git tree just by adding one more series of symbolic links under
> /pub/scm. This links will point to the maintainer's "official" trees, and
> we will have only one such link per project.
> 
> So typical downloading howto for end-users will looks like:
> 
>   git clone git://staging.openfabrics.org/dapl
>   git clone git://staging.openfabrics.org/ibutils
>   git clone git://staging.openfabrics.org/imgen
>   ...
> 
> instead of
> 
>   git clone git://staging.openfabrics.org/~ardavis/dapl
>   git clone git://staging.openfabrics.org/~eitan/ibutils
>   git clone git://staging.openfabrics.org/~mst/imgen
>   ...
> 
> as it is now.

NACK, please remove this. These soft links are messy, and
the fact that one needs root just to add a tree shows just how the approach
is broken.

If you have some temporary tree, just mention this in description,
and gitweb will show this. And won't the problem basically go away
if you move ~sashak temporary trees out of ~/scm? It seems we don't
have a lot of duplicates besides that.

<rant>
But in the long run, no development git tree is or should be the *official*
one - otherwise we get back to the mess we had with svn, with people
pushing for inclusion in the "official" tree just to get visibility.
The result? The "official" tree then becomes also the least stable.

What we need is official *releases*. Not official development trees.
And end users should either stick to releases or know what they are doing
and select the tree they actually *want*.
</rant>

-- 
MST


From swise at opengridcomputing.com  Sun Dec 10 14:04:33 2006
From: swise at opengridcomputing.com (Steve WIse)
Date: Sun, 10 Dec 2006 16:04:33 -0600
Subject: [openib-general] [PATCH] - ucma updates for miscdev changes
Message-ID: <1165788273.25243.8.camel@linux-q667.site>

Sean, 

As part of merging up to linus's tree as of 12/8/2006, I had to change
ucma.c to support changes in the miscdevice stuff.  Below is a patch for
this.  In addition to this change, I had to merge your ucma patches to
get them to apply.  Nothing functional changed, I don't think, but some
of the changes in your tree are already in linus's tree, so those
patches were ignored.  And one didn't apply cleanly and I had to fix it
manually.    

You can see these changes including the patch below as a single patch in
git://staging.openfabrics.org/~swise/cxgb3.git commit number:
d1ac2e74680d61a5e87165e1c6b4cec44533f2bd.


Signed-off-by: Steve Wise <swise at opengridcomputing.com>


-----


--- rdma-dev/drivers/infiniband/core/ucma.c	2006-12-08 11:03:31.000000000 -0600
+++ cxgb3.git/drivers/infiniband/core/ucma.c	2006-12-09 09:41:03.000000000 -0600
@@ -836,11 +836,12 @@ static struct miscdevice ucma_misc = {
 	.fops	= &ucma_fops,
 };
 
-static ssize_t show_abi_version(struct class_device *class_dev, char *buf)
+static ssize_t show_abi_version(struct device *class_dev, 
+				struct device_attribute *attr, char *buf)
 {
 	return sprintf(buf, "%d\n", RDMA_USER_CM_ABI_VERSION);
 }
-static CLASS_DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL);
+static DEVICE_ATTR(abi_version, S_IRUGO, show_abi_version, NULL);
 
 static int __init ucma_init(void)
 {
@@ -850,8 +851,7 @@ static int __init ucma_init(void)
 	if (ret)
 		return ret;
 
-	ret = class_device_create_file(ucma_misc.class,
-				       &class_device_attr_abi_version);
+	ret = device_create_file(ucma_misc.this_device, &dev_attr_abi_version);
 	if (ret) {
 		printk(KERN_ERR "rdma_ucm: couldn't create abi_version attr\n");
 		goto err;
@@ -864,8 +864,7 @@ err:
 
 static void __exit ucma_cleanup(void)
 {
-	class_device_remove_file(ucma_misc.class, 
-				 &class_device_attr_abi_version);
+	device_remove_file(ucma_misc.this_device, &dev_attr_abi_version);
 	misc_deregister(&ucma_misc);
 	idr_destroy(&ctx_idr);
 }


From sashak at voltaire.com  Sun Dec 10 14:18:05 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 00:18:05 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061210213920.GF9205@mellanox.co.il>
References: <1164829955.28427.69.camel@stevo-desktop>
	<20061129203916.GL16763@sashak.voltaire.com>
	<1164835084.28427.83.camel@stevo-desktop>
	<20061129214302.GF18978@sashak.voltaire.com>
	<20061210213920.GF9205@mellanox.co.il>
Message-ID: <20061210221805.GD21155@sashak.voltaire.com>

On 23:39 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > Sean, you can do
> > 
> >   chmod 755 hooks/post-update
> > 
> > This hook runs git-server-update-info after each push.
> 
> It seems we really want this as default.
> Sasha, could you please
> chmod 755 /usr/share/git-core/templates/hooks/pre-commit
> so that this will be the default for all new users?

Would prefer to not do this. All hooks are "off" is reasonable default
IMO and this should be tree maintainer's decision to enable specific
hook or not.

If somebody needs help with setup, we can help, or we could write sort
of 'howto' if there are common problems. But I think we cannot take
"ownership" there.

Sasha


From sashak at voltaire.com  Sun Dec 10 14:33:29 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 00:33:29 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <20061210210543.GB9205@mellanox.co.il>
References: <20061210205203.GA21155@sashak.voltaire.com>
	<20061210210543.GB9205@mellanox.co.il>
Message-ID: <20061210223329.GE21155@sashak.voltaire.com>

On 23:05 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > > Guys, don't be so scared of branches, they don't really have
> > > any significant overhead in git: branch (and tag) are basically
> > > just symbolic names for commit.
> > 
> > Right, branch in git is cheap, and if one needs branch in his tree he can
> > just create this branch in his tree, it is not necessary to ask origin's
> > tree maintainer to create this branch for him.
> 
> I agree.
> If, on the other hand, the tree maintainer wants someone else to test a set of
> patches automatically, the simplest way is for *said maintainer* to create a
> known branch or tag with that patch set, and have test scripts pick that up.
>
> It's really simple - if you want people to help, make it easy on them.

I agree with last sentence, but it is not "git angle" :)

Sasha


From swise at opengridcomputing.com  Sun Dec 10 14:32:44 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:32:44 -0600
Subject: [openib-general] [PATCH  v3 00/13] 2.6.20 Chelsio T3 RDMA Driver
Message-ID: <20061210223244.27166.36192.stgit@dell3.ogc.int>


Roland, 

I believe all comments so far have been incorporated.

Version 3 changes:

- BugFix: Don't use mutex inside of the mmap function.
- BugFix: Move QP to TERMINATE when TERMINATE AE is processed
- Support the new work queue design
- Merged up to linus's tree as of 12/8/2006
- Misc nits

Version 2 changes:

- Make code sparse endian clean
- Use IDRs for mapping QP and CQ IDs to structure pointers instead
  of arrays
- Clean up confusing bitfields
- Use random32() instead of local random function
- Use krefs to track endpoint reference counts
- Misc nits

-----

The following series implements the Chelsio T3 iWARP/RDMA Driver to
be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
Ethernet driver which is also under review now for 2.6.20. See

http://www.mail-archive.com/netdev at vger.kernel.org/msg27801.html

for the latest posting of the T3 Ethernet driver.

This patch series is against Linus's tree as of 12/8/2006 and can also
be pulled from:

	http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v3.tar.bz2

The Chelsio T3 Ethernet driver patch can be pulled from:

	http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

A complete GIT kernel tree with all the T3 drivers can be pulled from:

	git://staging.openfabrics.org/~swise/cxgb3.git

Thanks,

Steve.


From swise at opengridcomputing.com  Sun Dec 10 14:33:15 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:33:15 -0600
Subject: [openib-general] [PATCH  v3 01/13] Linux RDMA Core Changes
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223314.27166.28952.stgit@dell3.ogc.int>


Support provider-specific data in ib_uverbs_cmd_req_notify_cq().
The Chelsio iwarp provider library needs to pass information to the
kernel verb for re-arming the CQ.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/uverbs_cmd.c      |    9 +++++++--
 drivers/infiniband/hw/amso1100/c2.h       |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c    |    3 ++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    3 ++-
 drivers/infiniband/hw/ehca/ehca_reqs.c    |    3 ++-
 drivers/infiniband/hw/ipath/ipath_cq.c    |    4 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |    3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c    |    6 ++++--
 drivers/infiniband/hw/mthca/mthca_dev.h   |    4 ++--
 include/rdma/ib_verbs.h                   |    5 +++--
 10 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..5dd1de9 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_udata		      udata;
 	struct ib_cq                  *cq;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 	if (!cq)
 		return -EINVAL;
 
-	ib_req_notify_cq(cq, cmd.solicited_only ?
-			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+	INIT_UDATA(&udata, buf + sizeof cmd, 0,
+		   in_len - sizeof cmd, 0); 
+
+	cq->device->req_notify_cq(cq, cmd.solicited_only ?
+				  IB_CQ_SOLICITED : IB_CQ_NEXT_COMP,
+				  &udata);
 
 	put_cq_read(cq);
 
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
index 04a9db5..9a76869 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..7ce8bca 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n
 	return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+	      struct ib_udata *udata)
 {
 	struct c2_mq_shared __iomem *shared;
 	struct c2_cq *cq;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 3720e30..566b30c 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			     struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..3ed6992 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,7 +634,8 @@ poll_cq_exit0:
 	return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 87462e0..27ba4db 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
  * @notify: the type of notification to request
+ * @udata: user data 
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
 	unsigned long flags;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 8039f6e..0d39960 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_
 
 int ipath_destroy_cq(struct ib_cq *ibcq);
 
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata);
 
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 283d50b..15cbd49 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -722,7 +722,8 @@ repoll:
 	return err == 0 || err == -EAGAIN ? npolled : err;
 }
 
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
 {
 	__be32 doorbell[2];
 
@@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
 	return 0;
 }
 
-int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+		       struct ib_udata *udata)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	__be32 doorbell[2];
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index fe5cecf..6b9ccf6 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
 
 int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
+int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8eacc35..e3e1a2c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -941,7 +941,8 @@ struct ib_device {
 					      struct ib_wc *wc);
 	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
 	int                        (*req_notify_cq)(struct ib_cq *cq,
-						    enum ib_cq_notify cq_notify);
+						    enum ib_cq_notify cq_notify,
+						    struct ib_udata *udata);
 	int                        (*req_ncomp_notif)(struct ib_cq *cq,
 						      int wc_cnt);
 	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
@@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
 static inline int ib_req_notify_cq(struct ib_cq *cq,
 				   enum ib_cq_notify cq_notify)
 {
-	return cq->device->req_notify_cq(cq, cq_notify);
+	return cq->device->req_notify_cq(cq, cq_notify, NULL);
 }
 
 /**


From swise at opengridcomputing.com  Sun Dec 10 14:33:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:33:45 -0600
Subject: [openib-general] [PATCH v3 02/13] Device Discovery and ULLD Linkage
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223345.27166.26908.stgit@dell3.ogc.int>


Code to discover all the T3 devices and register them 
with the T3 RDMA Core and the Linux RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch.c |  189 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch.h |  175 +++++++++++++++++++++++++++++++++
 2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 0000000..acbe449
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define DRV_VERSION "1.1"
+
+MODULE_AUTHOR("Boyd Faulkner, Steve Wise");
+MODULE_DESCRIPTION("Chelsio T3 RDMA Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_dev(struct t3cdev *);
+static void close_rnic_dev(struct t3cdev *);
+
+struct cxgb3_client t3c_client = {
+	.name = "iw_cxgb3",
+	.add = open_rnic_dev,
+	.remove = close_rnic_dev,
+	.handlers = t3c_handlers,
+	.redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static void rnic_init(struct iwch_dev *rnicp)
+{
+	PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
+	idr_init(&rnicp->cqidr);
+	idr_init(&rnicp->qpidr);
+	idr_init(&rnicp->mmidr);
+	spin_lock_init(&rnicp->lock);
+
+	rnicp->attr.vendor_id = 0x168;
+	rnicp->attr.vendor_part_id = 7;
+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+	rnicp->attr.max_wrs = (1UL << 24) - 1;
+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+	rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev);
+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
+	rnicp->attr.can_resize_wq = 0;
+	rnicp->attr.max_rdma_reads_per_qp = 8;
+	rnicp->attr.max_rdma_read_resources =
+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+	rnicp->attr.max_rdma_read_qp_depth = 8;	/* IRD */
+	rnicp->attr.max_rdma_read_depth =
+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+	rnicp->attr.rq_overflow_handled = 0;
+	rnicp->attr.can_modify_ird = 0;
+	rnicp->attr.can_modify_ord = 0;
+	rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
+	rnicp->attr.stag0_value = 1;
+	rnicp->attr.zbva_support = 1;
+	rnicp->attr.local_invalidate_fence = 1;
+	rnicp->attr.cq_overflow_detection = 1;
+	return;
+}
+
+static void open_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *rnicp;
+	static int vers_printed;
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	if (!vers_printed++) 
+		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+		       DRV_VERSION);
+	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+	if (!rnicp) {
+		printk(KERN_ERR MOD "Cannot allocate ib device\n");
+		return;
+	}
+	rnicp->rdev.ulp = rnicp;
+	rnicp->rdev.t3cdev_p = tdev;
+
+	if (cxio_rdev_open(&rnicp->rdev)) {
+		printk(KERN_ERR MOD "Unable to open CXIO rdev\n");
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	rnic_init(rnicp);
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&rnicp->entry, &dev_list);
+	mutex_unlock(&dev_mutex);
+
+	if (iwch_register_device(rnicp)) {
+		printk(KERN_ERR MOD "Unable to register device\n");
+		close_rnic_dev(tdev);
+	}
+	printk(KERN_INFO MOD "Initialized device %s\n",
+	       pci_name(rnicp->rdev.rnic_info.pdev));
+	return;
+}
+
+static void close_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *dev, *tmp;
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	mutex_lock(&dev_mutex);
+	list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
+		if (dev->rdev.t3cdev_p == tdev) {
+			list_del(&dev->entry);
+			iwch_unregister_device(dev);
+			cxio_rdev_close(&dev->rdev);
+			idr_destroy(&dev->cqidr);
+			idr_destroy(&dev->qpidr);
+			idr_destroy(&dev->mmidr);
+			ib_dealloc_device(&dev->ibdev);
+			break;
+		}
+	}
+	mutex_unlock(&dev_mutex);
+}
+
+extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb);
+
+static int __init iwch_init_module(void)
+{
+	int err;
+
+	err = cxio_hal_init();
+	if (err) 
+		return err;
+	err = iwch_cm_init();
+	if (err) 
+		return err;
+	cxio_register_ev_cb(iwch_ev_dispatch);
+	cxgb3_register_client(&t3c_client);
+	return 0;
+}
+
+static void __exit iwch_exit_module(void)
+{
+	cxgb3_unregister_client(&t3c_client);
+	cxio_unregister_ev_cb(iwch_ev_dispatch);
+	iwch_cm_term();
+	cxio_hal_exit();
+}
+
+module_init(iwch_init_module);
+module_exit(iwch_exit_module);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
new file mode 100644
index 0000000..752b6ad
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_H__
+#define __IWCH_H__
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/idr.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+
+struct iwch_pd;
+struct iwch_cq;
+struct iwch_qp;
+struct iwch_mr;
+
+struct iwch_rnic_attributes {
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 max_qps;
+	u32 max_wrs;				/* Max for any SQ/RQ */
+	u32 max_sge_per_wr;
+	u32 max_sge_per_rdma_write_wr;	/* for RDMA Write WR */
+	u32 max_cqs;
+	u32 max_cqes_per_cq;
+	u32 max_mem_regs;
+	u32 max_phys_buf_entries;		/* for phys buf list */
+	u32 max_pds;
+
+	/* 
+	 * The memory page sizes supported by this RNIC.
+	 * Bit position i in bitmap indicates page of
+	 * size (4k)^i.  Phys block list mode unsupported. 
+	 */
+	u32 mem_pgsizes_bitmask;
+	u8 can_resize_wq;
+
+	/*
+	 * The maximum number of RDMA Reads that can be outstanding 
+	 * per QP with this RNIC as the target. 
+	 */
+	u32 max_rdma_reads_per_qp;
+
+	/*
+	 * The maximum number of resources used for RDMA Reads
+	 * by this RNIC with this RNIC as the target. 
+	 */
+	u32 max_rdma_read_resources;
+
+	/*
+	 * The max depth per QP for initiation of RDMA Read
+	 * by this RNIC.  
+	 */
+	u32 max_rdma_read_qp_depth;
+
+	/*
+	 * The maximum depth for initiation of RDMA Read 
+	 * operations by this RNIC on all QPs 
+	 */
+	u32 max_rdma_read_depth;
+	u8 rq_overflow_handled;
+	u32 can_modify_ird;
+	u32 can_modify_ord;
+	u32 max_mem_windows;
+	u32 stag0_value;
+	u8 zbva_support;
+	u8 local_invalidate_fence;
+	u32 cq_overflow_detection;
+};
+
+struct iwch_dev {
+	struct ib_device ibdev;
+	struct cxio_rdev rdev;
+	u32 device_cap_flags;
+	struct iwch_rnic_attributes attr;
+	struct idr cqidr;
+	struct idr qpidr;
+	struct idr mmidr;
+	spinlock_t lock;
+	struct list_head entry;
+};
+
+static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct iwch_dev, ibdev);
+}
+
+static inline int t3b_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3B);
+}
+
+static inline int t3a_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3A);
+}
+
+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid)
+{
+	return idr_find(&rhp->cqidr, cqid);
+}
+
+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid)
+{
+	return idr_find(&rhp->qpidr, qpid);
+}
+
+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid)
+{
+	return idr_find(&rhp->mmidr, mmid);
+}
+
+static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, 
+				void *handle, u32 id)
+{
+	int ret;
+	u32 newid;
+
+	do {
+		if (!idr_pre_get(idr, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		spin_lock_irq(&rhp->lock);
+		ret = idr_get_new_above(idr, handle, id, &newid);
+		BUG_ON(newid != id);
+		spin_unlock_irq(&rhp->lock);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
+{
+	spin_lock_irq(&rhp->lock);
+	idr_remove(idr, id);
+	spin_unlock_irq(&rhp->lock);
+}
+
+extern struct cxgb3_client t3c_client;
+extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:34:15 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:34:15 -0600
Subject: [openib-general] [PATCH v3 03/13] Provider Methods and Data
	Structures
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223415.27166.42003.stgit@dell3.ogc.int>


Provider methods to support the Linux RDMA verbs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  363 ++++++++
 drivers/infiniband/hw/cxgb3/iwch_user.h     |   68 ++
 3 files changed, 1602 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 0000000..e9721b1
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/ethtool.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <cxio_hal.h>
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+			    u8 port, int port_modify_mask,
+			    struct ib_port_modify *props)
+{
+	return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+				    struct ib_ah_attr *ah_attr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+			    int mad_flags,
+			    u8 port_num,
+			    struct ib_wc *in_wc,
+			    struct ib_grh *in_grh,
+			    struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+	struct iwch_dev *rhp = to_iwch_dev(context->device);
+	struct iwch_ucontext *ucontext = to_iwch_ucontext(context);
+	PDBG("%s context %p\n", __FUNCTION__, context);
+	cxio_release_ucontext(&rhp->rdev, &ucontext->uctx);
+	kfree(ucontext);
+	return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+					struct ib_udata *udata)
+{
+	struct iwch_ucontext *context;
+	struct iwch_dev *rhp = to_iwch_dev(ibdev);
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+	cxio_init_ucontext(&rhp->rdev, &context->uctx);
+	INIT_LIST_HEAD(&context->mmaps);
+	spin_lock_init(&context->mmap_lock);
+	return &context->ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct iwch_cq *chp;
+
+	PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq);
+	chp = to_iwch_cq(ib_cq);
+
+	remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid);
+	atomic_dec(&chp->refcnt);
+	wait_event(chp->wait, !atomic_read(&chp->refcnt));
+
+	cxio_destroy_cq(&chp->rhp->rdev, &chp->cq);
+	kfree(chp);
+	return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+			     struct ib_ucontext *context,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	struct iwch_create_cq_resp uresp;
+
+	PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries);
+	rhp = to_iwch_dev(ibdev);
+	chp = kzalloc(sizeof(*chp), GFP_KERNEL);
+	if (!chp)
+		return ERR_PTR(-ENOMEM);
+
+	if (t3a_device(rhp)) {
+
+		/*
+		 * T3A: Add some fluff to handle extra CQEs inserted 
+	 	 * for various errors.
+		 * Additional CQE possibilities:
+		 *      TERMINATE,
+		 *      incoming RDMA WRITE Failures
+		 *      incoming RDMA READ REQUEST FAILUREs
+		 * NOTE: We cannot ensure the CQ won't overflow.
+		 */
+		entries += 16; 
+	}
+	entries = roundup_pow_of_two(entries);
+	chp->cq.size_log2 = ilog2(entries);
+
+	if (cxio_create_cq(&rhp->rdev, &chp->cq)) {
+		kfree(chp);
+		return ERR_PTR(-ENOMEM);
+	}
+	chp->rhp = rhp;
+	chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1;
+	spin_lock_init(&chp->lock);
+	atomic_set(&chp->refcnt, 1);
+	init_waitqueue_head(&chp->wait);
+	insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid);
+
+	if (context) {
+		struct iwch_mm_entry *mm;
+
+		mm = kmalloc(sizeof *mm, GFP_KERNEL);
+		if (!mm) {
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-ENOMEM);
+		}
+		uresp.cqid = chp->cq.cqid;
+		uresp.size_log2 = chp->cq.size_log2;
+		uresp.physaddr = virt_to_phys(chp->cq.queue);
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm);
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-EFAULT);
+		}
+		mm->addr = uresp.physaddr;
+		mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * 
+					     sizeof (struct t3_cqe));
+		insert_mmap(to_iwch_ucontext(context), mm);
+	}
+	PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n",
+	     chp->cq.cqid, chp, (1 << chp->cq.size_log2), 
+	     (u64)chp->cq.dma_addr);
+	return &chp->ibcq;
+}
+
+static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
+{
+	struct iwch_cq *chp = to_iwch_cq(cq);
+	struct t3_cq oldcq, newcq;
+	int ret;
+
+	PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe);
+
+	/* We don't downsize... */
+	if (cqe <= cq->cqe)
+		return 0;
+
+	/* create new t3_cq with new size */
+	cqe = roundup_pow_of_two(cqe+1);
+	newcq.size_log2 = ilog2(cqe);
+
+	/* Dont allow resize to less than the current wce count */
+	if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) {
+		return -ENOMEM;
+	}
+
+	/* Quiesce all QPs using this CQ */
+	ret = iwch_quiesce_qps(chp);
+	if (ret) {
+		return ret;
+	}
+
+	ret = cxio_create_cq(&chp->rhp->rdev, &newcq);
+	if (ret) {
+		kfree(chp);
+		return ret;
+	}
+	
+	/* copy CQEs */
+	memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * 
+				        sizeof(struct t3_cqe));
+
+	/* old iwch_qp gets new t3_cq but keeps old cqid */
+	oldcq = chp->cq;
+	chp->cq = newcq;
+	chp->cq.cqid = oldcq.cqid;
+
+	/* resize new t3_cq to update the HW context */
+	ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq);
+	if (ret) {
+		chp->cq = oldcq;
+		return ret;
+	}
+	chp->ibcq.cqe = (1<<chp->cq.size_log2) - 1;
+
+	/* destroy old t3_cq */
+	oldcq.cqid = newcq.cqid;
+	ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq);
+	if (ret) {
+		printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", 
+			__FUNCTION__, ret);
+	}
+	
+	/* add user hooks here */
+
+	/* resume qps */
+	ret = iwch_resume_qps(chp);
+	return ret;
+}
+
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	enum t3_cq_opcode cq_op;
+	int err;
+	unsigned long flag;
+	struct iwch_req_notify_cq ucmd;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+	if (notify == IB_CQ_SOLICITED)
+		cq_op = CQ_ARM_SE;
+	else
+		cq_op = CQ_ARM_AN;
+	if (udata && t3b_device(rhp)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+			return -EFAULT;
+		spin_lock_irqsave(&chp->lock, flag);
+		chp->cq.rptr = ucmd.rptr;
+	} else
+		spin_lock_irqsave(&chp->lock, flag);
+	PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr);
+	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
+	spin_unlock_irqrestore(&chp->lock, flag);
+	if (err) 
+		printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, 
+		       chp->cq.cqid);
+	return err;
+}
+
+static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	int len = vma->vm_end - vma->vm_start;
+	u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT;
+	struct cxio_rdev *rdev_p;
+	int ret = 0;
+	struct iwch_mm_entry *mm;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, 
+	     pgaddr, len);
+
+	if (vma->vm_start & (PAGE_SIZE-1)) {
+                return -EINVAL;
+        }
+
+	rdev_p = &(to_iwch_dev(context->device)->rdev);
+	ucontext = to_iwch_ucontext(context);
+
+	mm = remove_mmap(ucontext, pgaddr, len);
+	if (!mm)
+		return -EINVAL;
+	kfree(mm);
+
+	if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && 
+	    (pgaddr < (rdev_p->rnic_info.udbell_physbase + 
+		       rdev_p->rnic_info.udbell_len))) {
+
+		/*
+		 * Map T3 DB register.
+		 */
+		if (vma->vm_flags & VM_READ) {
+                	return -EPERM;
+		}
+
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+		vma->vm_flags &= ~VM_MAYREAD;
+		ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	} else {
+
+		/*
+		 * Map WQ or CQ contig dma memory...
+		 */
+		ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	}
+	
+	return ret;
+}
+
+static int iwch_deallocate_pd(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid);
+	cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid);
+	kfree(php);
+	return 0;
+}
+
+static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev,
+			       struct ib_ucontext *context,
+			       struct ib_udata *udata)
+{
+	struct iwch_pd *php;
+	u32 pdid;
+	struct iwch_dev *rhp;
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	rhp = (struct iwch_dev *) ibdev;
+	pdid = cxio_hal_get_pdid(rhp->rdev.rscp);
+	if (!pdid)
+		return ERR_PTR(-EINVAL);
+	php = kzalloc(sizeof(*php), GFP_KERNEL);
+	if (!php) {
+		cxio_hal_put_pdid(rhp->rdev.rscp, pdid);
+		return ERR_PTR(-ENOMEM);
+	}
+	php->pdid = pdid;
+	php->rhp = rhp;
+	if (context) {
+		if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) {
+			iwch_deallocate_pd(&php->ibpd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php);
+	return &php->ibpd;
+}
+ 
+static int iwch_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mr *mhp;
+	u32 mmid;
+
+	PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr);
+	/* There can be no memory windows */
+	if (atomic_read(&ib_mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(ib_mr);
+	rhp = mhp->rhp;
+	mmid = mhp->attr.stag >> 8;
+	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, 
+		       mhp->attr.pbl_addr);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	if (mhp->kva)
+		kfree((void *) (unsigned long) mhp->kva);
+	PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp);
+	kfree(mhp);
+	return 0;
+}
+
+static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
+					struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					int acc,
+					u64 *iova_start)
+{
+	__be64 *page_list;
+	int shift;
+	u64 total_size;
+	int npages;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	int ret;
+		
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+
+	acc = iwch_convert_access(acc);
+
+	
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
+			 	   &total_size, &npages, &shift, &page_list);
+	if (ret) 
+		goto err;
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+
+	/* NOTE: TPT perms are backwards from BIND WR perms! */
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+
+	mhp->attr.va_fbo = *iova_start;
+	mhp->attr.page_size = shift - 12;
+
+	mhp->attr.len = (u32) total_size;
+	mhp->attr.pbl_size = npages;
+	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		goto err;
+	}
+	return &mhp->ibmr;
+err:
+	kfree(mhp);
+	return ERR_PTR(ret);
+	
+}
+
+static int iwch_reregister_phys_mem(struct ib_mr *mr, 
+				     int mr_rereg_mask,
+				     struct ib_pd *pd,
+                                     struct ib_phys_buf *buffer_list,
+                                     int num_phys_buf,
+                                     int acc, u64 * iova_start)
+{
+
+	struct iwch_mr mh, *mhp;
+	struct iwch_pd *php;
+	struct iwch_dev *rhp;
+	int new_acc;
+	__be64 *page_list = NULL;
+	int shift = 0;
+	u64 total_size;
+	int npages;
+	int ret;
+
+	PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd);
+
+	/* There can be no memory windows */
+	if (atomic_read(&mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(mr);
+	rhp = mhp->rhp;
+	php = to_iwch_pd(mr->pd);
+
+	/* make sure we are on the same adapter */
+	if (rhp != php->rhp)
+		return -EINVAL;
+
+	new_acc = mhp->attr.perms;
+
+	memcpy(&mh, mhp, sizeof *mhp);
+
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		php = to_iwch_pd(pd);
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mh.attr.perms = iwch_convert_access(acc);
+	if (mr_rereg_mask & IB_MR_REREG_TRANS)
+		ret = build_phys_page_list(buffer_list, num_phys_buf, 
+					   iova_start,
+					   &total_size, &npages, 
+					   &shift, &page_list);
+
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages);
+	kfree(page_list);
+	if (ret) {
+		return ret;
+	}
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		mhp->attr.pdid = php->pdid;
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mhp->attr.perms = acc;
+	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		mhp->attr.zbva = 0;
+		mhp->attr.va_fbo = *iova_start;
+		mhp->attr.page_size = shift - 12;
+		mhp->attr.len = (u32) total_size;
+		mhp->attr.pbl_size = npages;
+	}
+
+	return 0;	
+}
+
+
+struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				      int acc, struct ib_udata *udata)
+{
+	__be64 *pages;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	struct iwch_reg_user_mr_resp uresp;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	shift = ffs(region->page_size) - 1;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	acc = iwch_convert_access(acc);
+
+	i = n = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = cpu_to_be64(sg_dma_address(
+					&chunk->page_list[j]) +
+					region->page_size * k);
+			}
+		}
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+	mhp->attr.va_fbo = region->virt_base;
+	mhp->attr.page_size = shift - 12;
+	mhp->attr.len = (u32) region->length;
+	mhp->attr.pbl_size = i;
+	err = iwch_register_mem(rhp, php, mhp, shift, pages);
+	kfree(pages);
+	if (err)
+		goto err;
+
+	if (udata && t3b_device(rhp)) {
+		uresp.pbl_addr = (mhp->attr.pbl_addr -
+                                 rhp->rdev.rnic_info.pbl_base) >> 3;
+		PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, 
+		     uresp.pbl_addr);
+			
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			iwch_dereg_mr(&mhp->ibmr);
+			err = -EFAULT;
+			goto err;
+		}
+	}
+
+	return &mhp->ibmr;
+
+err:
+	kfree(mhp);
+	return ERR_PTR(err);
+}
+
+struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva;
+	struct ib_mr *ibmr;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+
+	/*
+	 * T3 only supports 32 bits of size.
+	 */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	kva = 0;
+	ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva);
+	return ibmr;
+}
+
+struct ib_mw *iwch_alloc_mw(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mw *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+	ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_MW;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag);
+	return &(mhp->ibmw);
+}
+
+int iwch_dealloc_mw(struct ib_mw *mw)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	u32 mmid;
+
+	mhp = to_iwch_mw(mw);
+	rhp = mhp->rhp;
+	mmid = (mw->rkey) >> 8;
+	cxio_deallocate_window(&rhp->rdev, mhp->attr.stag);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	kfree(mhp);
+	PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp);
+	return 0;
+}
+
+static int iwch_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_qp_attributes attrs;
+	struct iwch_ucontext *ucontext;
+
+	qhp = to_iwch_qp(ib_qp);
+	rhp = qhp->rhp;
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
+	}
+	wait_event(qhp->wait, !qhp->ep);
+
+	remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid);
+
+	atomic_dec(&qhp->refcnt);
+	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
+
+	ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) 
+				  : NULL;
+	cxio_destroy_qp(&rhp->rdev, &qhp->wq, 
+			ucontext ? &ucontext->uctx : &rhp->rdev.uctx);
+
+	PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, 
+	     ib_qp, qhp->wq.qpid, qhp);
+	kfree(qhp);
+	return 0;
+}
+
+static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *attrs,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_pd *php;
+	struct iwch_cq *schp;
+	struct iwch_cq *rchp;
+	struct iwch_create_qp_resp uresp;
+	int wqsize, sqsize, rqsize;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	if (attrs->qp_type != IB_QPT_RC) 
+		return ERR_PTR(-EINVAL);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
+	rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid);
+	if (!schp || !rchp)
+		return ERR_PTR(-EINVAL);
+
+	/* The RQT size must be # of entries + 1 rounded up to a power of two */
+	rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr);
+	if (rqsize == attrs->cap.max_recv_wr)
+		rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1);
+
+	/* T3 doesn't support RQT depth < 16 */
+	if (rqsize < 16)
+		rqsize = 16;
+
+	if (rqsize > T3_MAX_RQ_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	/* 
+	 * NOTE: The SQ and total WQ sizes don't need to be
+	 * a power of two.  However, all the code assumes 
+	 * they are. EG: Q_FREECNT() and friends.
+	 */
+	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
+	wqsize = roundup_pow_of_two(rqsize + sqsize);
+	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, 
+	     wqsize, sqsize, rqsize);
+	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
+	if (!qhp)
+		return ERR_PTR(-ENOMEM);
+	qhp->wq.size_log2 = ilog2(wqsize);
+	qhp->wq.rq_size_log2 = ilog2(rqsize);
+	qhp->wq.sq_size_log2 = ilog2(sqsize);
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
+			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
+		kfree(qhp);
+		return ERR_PTR(-ENOMEM);
+	}
+	attrs->cap.max_recv_wr = rqsize - 1;
+	attrs->cap.max_send_wr = sqsize;
+	qhp->rhp = rhp;
+	qhp->attr.pd = php->pdid;
+	qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid;
+	qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid;
+	qhp->attr.sq_num_entries = attrs->cap.max_send_wr;
+	qhp->attr.rq_num_entries = attrs->cap.max_recv_wr;
+	qhp->attr.sq_max_sges = attrs->cap.max_send_sge;
+	qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge;
+	qhp->attr.rq_max_sges = attrs->cap.max_recv_sge;
+	qhp->attr.state = IWCH_QP_STATE_IDLE;
+	qhp->attr.next_state = IWCH_QP_STATE_IDLE;
+
+	/* 
+	 * XXX - These don't get passed in from the openib user
+ 	 * at create time.  The CM sets them via a QP modify.
+	 * Need to fix...  I think the CM should 
+	 */
+	qhp->attr.enable_rdma_read = 1;
+	qhp->attr.enable_rdma_write = 1;
+	qhp->attr.enable_bind = 1;
+	qhp->attr.max_ord = 1;
+	qhp->attr.max_ird = 1;
+
+	spin_lock_init(&qhp->lock);
+	init_waitqueue_head(&qhp->wait);
+	atomic_set(&qhp->refcnt, 1);
+	insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid);
+
+	if (udata) {
+
+		struct iwch_mm_entry *mm1, *mm2;
+
+		mm1 = kmalloc(sizeof *mm1, GFP_KERNEL);
+		if (!mm1) {
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		mm2 = kmalloc(sizeof *mm2, GFP_KERNEL);
+		if (!mm2) {
+			kfree(mm1);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		uresp.qpid = qhp->wq.qpid;
+		uresp.size_log2 = qhp->wq.size_log2;
+		uresp.sq_size_log2 = qhp->wq.sq_size_log2;
+		uresp.rq_size_log2 = qhp->wq.rq_size_log2;
+		uresp.physaddr = virt_to_phys(qhp->wq.queue);
+		uresp.doorbell = qhp->wq.udb;
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm1);
+			kfree(mm2);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-EFAULT);
+		}
+		mm1->addr = uresp.physaddr;
+		mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr));
+		insert_mmap(ucontext, mm1);
+		mm2->addr = uresp.doorbell & PAGE_MASK;
+		mm2->len = PAGE_SIZE;
+		insert_mmap(ucontext, mm2);
+	}
+	qhp->ibqp.qp_num = qhp->wq.qpid;
+	init_timer(&(qhp->timer));
+	PDBG("%s sq_num_entries %d, rq_num_entries %d "
+	     "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n",
+	     __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries,
+	     qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2);
+	return (&qhp->ibqp);
+}
+
+static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	enum iwch_qp_attr_mask mask = 0;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp);
+
+	/* iwarp does not support the RTR state */
+	if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR))
+		attr_mask &= ~IB_QP_STATE;
+
+	/* Make sure we still have something left to do */
+	if (!attr_mask)
+		return 0;
+
+	memset(&attrs, 0, sizeof attrs);
+	qhp = to_iwch_qp(ibqp);
+	rhp = qhp->rhp;
+
+	attrs.next_state = iwch_convert_state(attr->qp_state);
+	attrs.enable_rdma_read = (attr->qp_access_flags & 
+			       IB_ACCESS_REMOTE_READ) ?  1 : 0;
+	attrs.enable_rdma_write = (attr->qp_access_flags & 
+				IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
+	attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0;
+
+
+	mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0;
+	mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? 
+			(IWCH_QP_ATTR_ENABLE_RDMA_READ |
+			 IWCH_QP_ATTR_ENABLE_RDMA_WRITE | 
+			 IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0;
+
+	return iwch_modify_qp(rhp, qhp, mask, &attrs, 0);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	atomic_inc(&(to_iwch_qp(qp)->refcnt));
+}
+
+void iwch_qp_rem_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt)))
+                wake_up(&(to_iwch_qp(qp)->wait));
+}
+
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn)
+{
+	PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn);
+	return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn);
+}
+
+
+static int iwch_query_pkey(struct ib_device *ibdev,
+			   u8 port, u16 index, u16 * pkey)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	*pkey = 0;
+	return 0;
+}
+
+static int iwch_query_gid(struct ib_device *ibdev, u8 port,
+			  int index, union ib_gid *gid)
+{
+	struct iwch_dev *dev;
+
+	PDBG("%s ibdev %p, port %d, index %d, gid %p\n",
+	       __FUNCTION__, ibdev, port, index, gid);
+	dev = to_iwch_dev(ibdev);
+	BUG_ON(port == 0 || port > 2);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6);
+	return 0;
+}
+
+static int iwch_query_device(struct ib_device *ibdev,
+			     struct ib_device_attr *props)
+{
+
+	struct iwch_dev *dev;
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+
+	dev = to_iwch_dev(ibdev);
+	memset(props, 0, sizeof *props);
+	memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	props->device_cap_flags = dev->device_cap_flags;
+	props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor;
+	props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device;
+	props->max_mr_size = ~0ull;
+	props->max_qp = dev->attr.max_qps;
+	props->max_qp_wr = dev->attr.max_wrs;
+	props->max_sge = dev->attr.max_sge_per_wr;
+	props->max_sge_rd = 1;
+	props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp;
+	props->max_cq = dev->attr.max_cqs;
+	props->max_cqe = dev->attr.max_cqes_per_cq;
+	props->max_mr = dev->attr.max_mem_regs;
+	props->max_pd = dev->attr.max_pds;
+	props->local_ca_ack_delay = 0;
+
+	return 0;
+}
+
+static int iwch_query_port(struct ib_device *ibdev,
+			   u8 port, struct ib_port_attr *props)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_SNMP_TUNNEL_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_DEVICE_MGMT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 2;
+	props->active_speed = 2;
+	props->max_msg_sz = -1;
+
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.fw_version);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.driver);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, dev);
+	return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor,
+		                       dev->rdev.rnic_info.pdev->device);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *iwch_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+int iwch_register_device(struct iwch_dev *dev)
+{
+	int ret;
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->device_cap_flags =
+	    (IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
+	dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
+	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.query_device = iwch_query_device;
+	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.modify_port = iwch_modify_port;
+	dev->ibdev.query_pkey = iwch_query_pkey;
+	dev->ibdev.query_gid = iwch_query_gid;
+	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext;
+	dev->ibdev.mmap = iwch_mmap;
+	dev->ibdev.alloc_pd = iwch_allocate_pd;
+	dev->ibdev.dealloc_pd = iwch_deallocate_pd;
+	dev->ibdev.create_ah = iwch_ah_create;
+	dev->ibdev.destroy_ah = iwch_ah_destroy;
+	dev->ibdev.create_qp = iwch_create_qp;
+	dev->ibdev.modify_qp = iwch_ib_modify_qp;
+	dev->ibdev.destroy_qp = iwch_destroy_qp;
+	dev->ibdev.create_cq = iwch_create_cq;
+	dev->ibdev.destroy_cq = iwch_destroy_cq;
+	dev->ibdev.resize_cq = iwch_resize_cq;
+	dev->ibdev.poll_cq = iwch_poll_cq;
+	dev->ibdev.get_dma_mr = iwch_get_dma_mr;
+	dev->ibdev.reg_phys_mr = iwch_register_phys_mem;
+	dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem;
+	dev->ibdev.reg_user_mr = iwch_reg_user_mr;
+	dev->ibdev.dereg_mr = iwch_dereg_mr;
+	dev->ibdev.alloc_mw = iwch_alloc_mw;
+	dev->ibdev.bind_mw = iwch_bind_mw;
+	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+
+	dev->ibdev.attach_mcast = iwch_multicast_attach;
+	dev->ibdev.detach_mcast = iwch_multicast_detach;
+	dev->ibdev.process_mad = iwch_process_mad;
+
+	dev->ibdev.req_notify_cq = iwch_arm_cq;
+	dev->ibdev.post_send = iwch_post_send;
+	dev->ibdev.post_recv = iwch_post_receive;
+
+
+	dev->ibdev.iwcm =
+	    (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs),
+					   GFP_KERNEL);
+	dev->ibdev.iwcm->connect = iwch_connect;
+	dev->ibdev.iwcm->accept = iwch_accept_cr;
+	dev->ibdev.iwcm->reject = iwch_reject_cr;
+	dev->ibdev.iwcm->create_listen = iwch_create_listen;
+	dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen;
+	dev->ibdev.iwcm->add_ref = iwch_qp_add_ref;
+	dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref;
+	dev->ibdev.iwcm->get_qp = iwch_get_qp;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		goto bail1;
+
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       iwch_class_attributes[i]);
+		if (ret) {
+			goto bail2;
+		}
+	}
+	return 0;
+bail2:
+	ib_unregister_device(&dev->ibdev);
+bail1:
+	return ret;
+}
+
+void iwch_unregister_device(struct iwch_dev *dev)
+{
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i)
+		class_device_remove_file(&dev->ibdev.class_dev,
+					 iwch_class_attributes[i]);
+	ib_unregister_device(&dev->ibdev);
+	return;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
new file mode 100644
index 0000000..4d98886
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -0,0 +1,363 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_PROVIDER_H__
+#define __IWCH_PROVIDER_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include <asm/types.h>
+#include "t3cdev.h"
+#include "iwch.h"
+#include "cxio_wr.h"
+#include "cxio_hal.h"
+
+struct iwch_pd {
+	struct ib_pd ibpd;
+	u32 pdid;
+	struct iwch_dev *rhp;
+};
+
+static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct iwch_pd, ibpd);
+}
+
+struct tpt_attributes {
+	u32 stag;
+	u32 state:1;
+	u32 type:2;
+	u32 rsvd:1;
+	enum tpt_mem_perm perms;
+	u32 remote_invaliate_disable:1;
+	u32 zbva:1;
+	u32 mw_bind_enable:1;
+	u32 page_size:5;
+
+	u32 pdid;
+	u32 qpid;
+	u32 pbl_addr;
+	u32 len;
+	u64 va_fbo;
+	u32 pbl_size;
+};
+
+struct iwch_mr {
+	struct ib_mr ibmr;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+typedef struct iwch_mw iwch_mw_handle;
+
+static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct iwch_mr, ibmr);
+}
+
+struct iwch_mw {
+	struct ib_mw ibmw;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw)
+{
+	return container_of(ibmw, struct iwch_mw, ibmw);
+}
+
+struct iwch_cq {
+	struct ib_cq ibcq;
+	struct iwch_dev *rhp;
+	struct t3_cq cq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+};
+
+static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct iwch_cq, ibcq);
+}
+
+enum IWCH_QP_FLAGS {
+	QP_QUIESCED = 0x01
+};
+
+struct iwch_mpa_attributes {
+	u8 recv_marker_enabled;
+	u8 xmit_marker_enabled;	/* iWARP: enable inbound Read Resp. */
+	u8 crc_enabled;
+	u8 version;	/* 0 or 1 */
+};
+
+struct iwch_qp_attributes {
+	u32 scq;
+	u32 rcq;
+	u32 sq_num_entries;
+	u32 rq_num_entries;
+	u32 sq_max_sges;
+	u32 sq_max_sges_rdma_write;
+	u32 rq_max_sges;
+	u32 state;
+	u8 enable_rdma_read;
+	u8 enable_rdma_write;	/* enable inbound Read Resp. */
+	u8 enable_bind;
+	u8 enable_mmid0_fastreg;	/* Enable STAG0 + Fast-register */
+	/*
+	 * Next QP state. If specify the current state, only the 
+	 * QP attributes will be modified.
+	 */
+	u32 max_ord;
+	u32 max_ird;
+	u32 pd;	/* IN */
+	u32 next_state;
+	char terminate_buffer[52];
+	u32 terminate_msg_len;
+	u8 is_terminate_local;
+	struct iwch_mpa_attributes mpa_attr;	/* IN-OUT */
+	struct iwch_ep *llp_stream_handle;
+	char *stream_msg_buf;	/* Last stream msg. before Idle -> RTS */
+	u32 stream_msg_buf_len;	/* Only on Idle -> RTS */
+};
+
+struct iwch_qp {
+	struct ib_qp ibqp;
+	struct iwch_dev *rhp;
+	struct iwch_ep *ep;
+	struct iwch_qp_attributes attr;
+	struct t3_wq wq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+	enum IWCH_QP_FLAGS flags;
+	struct timer_list timer;
+};
+
+static inline int qp_quiesced(struct iwch_qp *qhp)
+{
+	return (qhp->flags & QP_QUIESCED);
+}
+
+static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct iwch_qp, ibqp);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp);
+void iwch_qp_rem_ref(struct ib_qp *qp);
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn);
+
+struct iwch_ucontext {
+	struct ib_ucontext ibucontext;
+	struct cxio_ucontext uctx;
+	spinlock_t mmap_lock;
+	struct list_head mmaps;
+};
+
+static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c)
+{
+	return container_of(c, struct iwch_ucontext, ibucontext);
+}
+
+struct iwch_mm_entry {
+	struct list_head entry;
+	u64 addr;
+	unsigned len;
+};
+
+static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext,
+						u64 addr, unsigned len)
+{
+	struct list_head *pos, *nxt;
+	struct iwch_mm_entry *mm;
+
+	spin_lock_irq(&ucontext->mmap_lock);
+	list_for_each_safe(pos, nxt, &ucontext->mmaps) {
+		
+		mm = list_entry(pos, struct iwch_mm_entry, entry);
+		if (mm->addr == addr && mm->len == len) {
+			list_del_init(&mm->entry);
+			spin_unlock_irq(&ucontext->mmap_lock);
+			PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, 
+			     mm->len);
+			return mm;
+		}
+	}
+	spin_unlock_irq(&ucontext->mmap_lock);
+	return NULL;
+}
+
+static inline void insert_mmap(struct iwch_ucontext *ucontext, 
+			       struct iwch_mm_entry *mm)
+{
+	spin_lock_irq(&ucontext->mmap_lock);
+	PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len);
+	list_add_tail(&mm->entry, &ucontext->mmaps);
+	spin_unlock_irq(&ucontext->mmap_lock);
+}
+
+enum iwch_qp_attr_mask {
+	IWCH_QP_ATTR_NEXT_STATE = 1 << 0,
+	IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
+	IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
+	IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
+	IWCH_QP_ATTR_MAX_ORD = 1 << 11,
+	IWCH_QP_ATTR_MAX_IRD = 1 << 12,
+	IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22,
+	IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23,
+	IWCH_QP_ATTR_MPA_ATTR = 1 << 24,
+	IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25,
+	IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ |
+				     IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+				     IWCH_QP_ATTR_MAX_ORD |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_STREAM_MSG_BUFFER |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE)
+};
+
+int iwch_modify_qp(struct iwch_dev *rhp,
+				struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal);
+
+enum iwch_qp_state {
+	IWCH_QP_STATE_IDLE,
+	IWCH_QP_STATE_RTS,
+	IWCH_QP_STATE_ERROR,
+	IWCH_QP_STATE_TERMINATE,
+	IWCH_QP_STATE_CLOSING,
+	IWCH_QP_STATE_TOT
+};
+
+static inline int iwch_convert_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+	case IB_QPS_INIT:
+		return IWCH_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return IWCH_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return IWCH_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return IWCH_QP_STATE_TERMINATE;
+	case IB_QPS_ERR:
+		return IWCH_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+enum iwch_mem_perms {
+	IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0,
+	IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1,
+	IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2,
+	IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3,
+	IWCH_MEM_ACCESS_ATOMICS = 1 << 4,
+	IWCH_MEM_ACCESS_BINDING = 1 << 5,
+	IWCH_MEM_ACCESS_LOCAL =
+	    (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE),
+	IWCH_MEM_ACCESS_REMOTE =
+	    (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ)
+	    /* cannot go beyond 1 << 31 */
+} __attribute__ ((packed));
+
+static inline u32 iwch_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0)
+	    | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) |
+	    (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) |
+	    IWCH_MEM_ACCESS_LOCAL_READ;
+}
+
+enum iwch_mmid_state {
+	IWCH_STAG_STATE_VALID,
+	IWCH_STAG_STATE_INVALID
+};
+
+enum iwch_qp_query_flags {
+	IWCH_QP_QUERY_CONTEXT_NONE = 0x0,	/* No ctx; Only attrs */
+	IWCH_QP_QUERY_CONTEXT_GET = 0x1,	/* Get ctx + attrs */
+	IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2,	/* Not Supported */
+
+	/* 
+	 * Quiesce QP context; Consumer 
+	 * will NOT replay outstanding WR
+	 */
+	IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4,
+	IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8,
+	IWCH_QP_QUERY_TEST_USERWRITE = 0x32	/* Test special */
+};
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr);
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr);
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind);
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg);
+int iwch_register_device(struct iwch_dev *dev);
+void iwch_unregister_device(struct iwch_dev *dev);
+int iwch_quiesce_qps(struct iwch_cq *chp);
+int iwch_resume_qps(struct iwch_cq *chp);
+void stop_read_rep_timer(struct iwch_qp *qhp);
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list);
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages);
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list);
+
+
+#define IWCH_NODE_DESC "cxgb3 Chelsio Communications"
+
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h
new file mode 100644
index 0000000..4e4b9c9
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_user.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_USER_H__
+#define __IWCH_USER_H__
+
+#define IWCH_UVERBS_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct iwch_create_cq_resp {
+	__u64 physaddr;		
+	__u32 cqid;
+	__u32 size_log2;
+};
+
+struct iwch_create_qp_resp {
+	__u64 physaddr;
+	__u64 doorbell;	
+	__u32 qpid;
+	__u32 size_log2;
+	__u32 sq_size_log2;
+	__u32 rq_size_log2;
+};
+
+struct iwch_reg_user_mr_resp {
+	__u32 pbl_addr;
+};
+
+struct iwch_req_notify_cq {
+	__u32 rptr;
+};
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:34:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:34:45 -0600
Subject: [openib-general] [PATCH  v3 04/13] Connection Manager
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223445.27166.65471.stgit@dell3.ogc.int>


This code implements the iWARP CM provider methods for the Chelsio driver.
The Chelsio ULLD is used to setup and teardown TCP connections, and the
T3 RDMA Core is used to move the connections in and out of RDMA mode.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c | 2059 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h |  223 ++++
 drivers/infiniband/hw/cxgb3/tcb.h     |  603 ++++++++++
 3 files changed, 2885 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
new file mode 100644
index 0000000..4d5df00
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -0,0 +1,2059 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/skbuff.h>
+#include <linux/timer.h>
+#include <linux/notifier.h>
+
+#include <net/neighbour.h>
+#include <net/netevent.h>
+#include <net/route.h>
+
+#include "tcb.h"
+#include "cxgb3_offload.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+
+char *states[] = {
+	"idle",
+	"listen",
+	"connecting",
+	"mpa_wait_req",
+	"mpa_req_sent",
+	"mpa_req_rcvd",
+	"mpa_rep_sent",
+	"fpdu_mode",
+	"aborting",
+	"closing",
+	"moribund",
+	"dead",
+	NULL,
+};
+
+static int ep_timeout_secs = 10;
+module_param(ep_timeout_secs, int, 0444);
+MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
+				   "in seconds (default=10)");
+
+static int mpa_rev = 1;
+module_param(mpa_rev, int, 0444);
+MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
+		 "1 is spec compliant. (default=1)");
+
+static int markers_enabled = 0;
+module_param(markers_enabled, int, 0444);
+MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
+
+static int crc_enabled = 1;
+module_param(crc_enabled, int, 0444);
+MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
+
+static int rcv_win = 512 * 1024;
+module_param(rcv_win, int, 0444);
+MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)");
+
+static int snd_win = 512 * 1024;
+module_param(snd_win, int, 0444);
+MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)");
+
+static unsigned int nocong = 1;
+module_param(nocong, uint, 0444);
+MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)");
+
+static void process_work(struct work_struct *work);
+static struct workqueue_struct *workq;
+DECLARE_WORK(skb_work, process_work);
+
+static struct sk_buff_head rxq;
+static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS];
+
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp);
+static void ep_timeout(unsigned long arg);
+static void connect_reply_upcall(struct iwch_ep *ep, int status);
+
+static void start_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	if (timer_pending(&ep->timer)) {
+		PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep);
+		del_timer_sync(&ep->timer);
+	} else
+		get_ep(&ep->com);
+	ep->timer.expires = jiffies + ep_timeout_secs * HZ;
+	ep->timer.data = (unsigned long)ep;
+	ep->timer.function = ep_timeout;
+	add_timer(&ep->timer);
+}
+
+static void stop_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	del_timer_sync(&ep->timer);
+	put_ep(&ep->com);
+}
+
+static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
+{
+	struct cpl_tid_release *req;
+
+	skb = get_skb(skb, sizeof *req, GFP_KERNEL);
+	if (!skb)
+		return;
+	req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
+	skb->priority = CPL_PRIORITY_SETUP;
+	tdev->send(tdev, skb);
+	return;
+}
+
+int iwch_quiesce_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+int iwch_resume_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = 0;
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static void set_emss(struct iwch_ep *ep, u16 opt)
+{
+	PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt);
+	ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40;
+	if (G_TCPOPT_TSTAMP(opt))
+		ep->emss -= 12;
+	if (ep->emss < 128)
+		ep->emss = 128;
+	PDBG("emss=%d\n", ep->emss);
+}
+
+static int state_comp_exch(struct iwch_ep_common *epc,
+			   enum iwch_ep_state comp, 
+			   enum iwch_ep_state exch)
+{
+        unsigned long flags;
+        int ret;
+
+        spin_lock_irqsave(&epc->lock, flags);
+        ret = (epc->state == comp);
+        if (ret)
+                epc->state = exch;
+        spin_unlock_irqrestore(&epc->lock, flags);
+        return ret;
+}
+
+static enum iwch_ep_state state_read(struct iwch_ep_common *epc)
+{
+	unsigned long flags;
+	enum iwch_ep_state state;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	state = epc->state;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return state;
+}
+
+static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], 
+		states[new]);
+	epc->state = new;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return;
+}
+
+static void *alloc_ep(int size, gfp_t gfp)
+{
+	struct iwch_ep_common *epc;
+
+	epc = kmalloc(size, gfp);
+	if (epc) {
+		memset(epc, 0, size);
+		kref_init(&epc->kref);
+		spin_lock_init(&epc->lock);
+		init_waitqueue_head(&epc->waitq);
+	}
+	PDBG("%s alloc ep %p\n", __FUNCTION__, epc);
+	return (void *) epc;
+}
+
+void __free_ep(struct kref *kref) 
+{
+	struct iwch_ep_common *epc;
+	epc = container_of(kref, struct iwch_ep_common, kref);
+	PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]);
+	kfree(epc);
+}
+
+static void release_ep_resources(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	state_set(&ep->com, DEAD);
+	cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, ep->hwtid, NULL);
+	put_ep(&ep->com);
+}
+
+static void process_work(struct work_struct *work)
+{
+	struct sk_buff *skb = NULL;
+	void *ep;
+	struct t3cdev *tdev;
+	int ret;
+
+	while ((skb = skb_dequeue(&rxq))) {
+		ep = *((void **) (skb->cb));
+		tdev = *((struct t3cdev **) (skb->cb + sizeof(void *)));
+		ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep);
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+
+		/* 
+		 * ep was referenced in sched(), and is freed here.
+		 */
+		put_ep((struct iwch_ep_common *)ep);
+	}
+}
+
+static int status2errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_NONE:
+		return 0;
+	case CPL_ERR_CONN_RESET:
+		return -ECONNRESET;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Try and reuse skbs already allocated...
+ */
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp)
+{
+	if (skb) {
+		BUG_ON(skb_cloned(skb));
+		skb_trim(skb, 0);
+		skb_get(skb);
+	} else {
+		skb = alloc_skb(len, gfp);
+	}
+	return skb;
+}
+
+static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, 
+				 __be32 peer_ip, __be16 local_port,
+				 __be16 peer_port, u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = 0,
+		.nl_u = {
+			 .ip4_u = {
+				   .daddr = peer_ip,
+				   .saddr = local_ip,
+				   .tos = tos}
+			 },
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			  .ports = {
+				    .sport = local_port,
+				    .dport = peer_port}
+			  }
+	};
+
+	if (ip_route_output_flow(&rt, &fl, NULL, 0))
+		return NULL;
+	return rt;
+}
+
+static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu)
+		++i;
+	return i;
+}
+
+static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for an active open.   
+ */
+static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR MOD "ARP failure duing connect\n");
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for a CPL_ABORT_REQ.  Change it into a no RST variant
+ * and send it along.
+ */
+static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_abort_req *req = cplhdr(skb);
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb3_ofld_send(dev, skb);
+}
+
+static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
+{
+	struct cpl_close_con_req *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	struct cpl_abort_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(skb, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, abort_arp_failure);
+	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
+	req->cmd = CPL_ABORT_SEND_RST;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_connect(struct iwch_ep *ep)
+{
+	struct cpl_act_open_req *req;
+	struct sk_buff *skb;
+	u32 opt0h, opt0l, opt2;
+	unsigned int mtu_idx;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+	skb->priority = CPL_PRIORITY_SETUP;
+	set_arp_failure_handler(skb, act_open_req_arp_failure);
+
+	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->peer_port = ep->com.remote_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
+	req->opt0h = htonl(opt0h);
+	req->opt0l = htonl(opt0l);
+	req->params = 0;
+	req->opt2 = htonl(opt2);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+
+	PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen);
+
+	BUG_ON(skb_cloned(skb));
+
+	mpalen = sizeof(*mpa) + ep->plen;
+	if (skb->data + mpalen + sizeof(*req) > skb->end) {
+		kfree_skb(skb);
+		skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL);
+		if (!skb) {
+			connect_reply_upcall(ep, -ENOMEM);
+			return;
+		}
+	}
+	skb_trim(skb, 0);
+	skb_reserve(skb, sizeof(*req));
+	skb_put(skb, mpalen);
+	skb->priority = CPL_PRIORITY_DATA;
+	mpa = (struct mpa_message *) skb->data;
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key));
+	mpa->flags = (crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->private_data_size = htons(ep->plen);
+	mpa->revision = mpa_rev;
+
+	if (ep->plen)
+		memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	start_ep_timer(ep);
+	state_set(&ep->com, MPA_REQ_SENT);
+	return;
+}
+
+static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = MPA_REJECT;
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb again.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(mpalen);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	ep->mpa_skb = skb;
+	state_set(&ep->com, MPA_REP_SENT);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int tid = GET_TID(req);
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid);
+
+	dst_confirm(ep->dst);
+
+	/* setup the hwtid for this connection */
+	ep->hwtid = tid;
+	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
+
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	/* dealloc the atid */
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+
+	/* start MPA negotiation */
+	send_mpa_req(ep, skb);
+
+	return 0;
+}
+
+static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	PDBG("%s ep %p\n", __FILE__, ep);
+	state_set(&ep->com, ABORTING);
+	send_abort(ep, skb, GFP_KERNEL);
+}
+
+static void close_complete_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	if (ep->com.cm_id) {
+		PDBG("close complete delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void peer_close_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_DISCONNECT;
+	if (ep->com.cm_id) {
+		PDBG("peer close delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static void peer_abort_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	event.status = -ECONNRESET;
+	if (ep->com.cm_id) {
+		PDBG("abort delivered ep %p cm_id %p tid %d\n", ep,
+		     ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_reply_upcall(struct iwch_ep *ep, int status)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REPLY;
+	event.status = status;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+
+	if ((status == 0) || (status == -ECONNREFUSED)) {
+		event.private_data_len = ep->plen;
+		event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	}
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, 
+		     ep->hwtid, status);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+	if (status < 0) {
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_request_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REQUEST;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+	event.private_data_len = ep->plen;
+	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	event.provider_data = ep;
+	if (state_read(&ep->parent_ep->com) != DEAD)
+		ep->parent_ep->com.cm_id->event_handler(
+						ep->parent_ep->com.cm_id,
+						&event);
+	put_ep(&ep->parent_ep->com);
+	ep->parent_ep = NULL;
+}
+
+static void established_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_ESTABLISHED;
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static int update_rx_credits(struct iwch_ep *ep, u32 credits)
+{
+	struct cpl_rx_data_ack *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n");
+		return 0;
+	}
+
+	req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
+	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
+	skb->priority = CPL_PRIORITY_ACK;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return credits;
+}
+
+static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	int err;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+	state_set(&ep->com, FPDU_MODE);
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * if we don't even have the mpa message, then bail. 
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* Validate MPA header. */
+	if (mpa->revision != mpa_rev) {
+		err = -EPROTO;
+		goto err;
+	}
+	if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	if (mpa->flags & MPA_REJECT) {
+		err = -ECONNREFUSED;
+		goto err;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data. And
+	 * the MPA header is valid.
+	 */
+
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ird;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+	    IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR |
+	    IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD;
+
+	/* bind QP and TID with INIT_WR */
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+	if (!err)
+		goto out;
+err:
+	abort_connection(ep, skb);
+out:
+	connect_reply_upcall(ep, err);
+	return;
+}
+
+static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/*
+	 * Copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * If we don't even have the mpa message, then bail. 
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* 
+	 * Validate MPA Header.
+	 */
+	if (mpa->revision != mpa_rev) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		abort_connection(ep, skb);
+		return;
+	}
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data.
+	 */
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	state_set(&ep->com, MPA_REQ_RCVD);
+
+	/* drive upcall */
+	connect_request_upcall(ep);
+	return;
+}
+
+static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_rx_data *hdr = cplhdr(skb);
+	unsigned int dlen = ntohs(hdr->len);
+
+	PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen);
+
+	skb_pull(skb, sizeof(*hdr));
+	skb_trim(skb, dlen);
+
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_SENT:
+		process_mpa_reply(ep, skb);
+		break;
+	case MPA_REQ_WAIT:
+		process_mpa_request(ep, skb);
+		break;
+	case MPA_REP_SENT:
+		break;
+	default:
+		printk(KERN_ERR MOD "%s Unexpected streaming data."
+		       " ep %p state %d tid %d\n",
+		       __FUNCTION__, ep, state_read(&ep->com), ep->hwtid);
+
+		/* 
+	 	 * The ep will timeout and inform the ULP of the failure.
+		 * See ep_timeout().
+	 	 */
+		break;
+	}
+
+	/* update RX credits */
+	update_rx_credits(ep, dlen);
+
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Upcall from the adapter indicating data has been transmitted.
+ * For us its just the single MPA request or reply.  We can now free
+ * the skb holding the mpa message.
+ */
+static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_wr_ack *hdr = cplhdr(skb);
+	unsigned int credits = ntohs(hdr->credits);
+	enum iwch_qp_attr_mask  mask;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+
+	if (credits == 0)
+		return CPL_RET_BUF_DONE;
+	BUG_ON(credits != 1);
+	BUG_ON(ep->mpa_skb == NULL);
+	kfree_skb(ep->mpa_skb);
+	ep->mpa_skb = NULL;
+	dst_confirm(ep->dst);
+	if (state_read(&ep->com) == MPA_REP_SENT) {
+		struct iwch_qp_attributes attrs;
+
+		/* bind QP to EP and move to RTS */
+		attrs.mpa_attr = ep->mpa_attr;
+		attrs.max_ird = ep->ord;
+		attrs.max_ord = ep->ord;
+		attrs.llp_stream_handle = ep;
+		attrs.next_state = IWCH_QP_STATE_RTS;
+
+		/* bind QP and TID with INIT_WR */
+		mask = IWCH_QP_ATTR_NEXT_STATE |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_MAX_ORD;
+
+		ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, mask, &attrs, 1);
+
+		if (!ep->com.rpl_err) {
+			state_set(&ep->com, FPDU_MODE);
+			established_upcall(ep);
+		}
+
+		ep->com.rpl_done = 1;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	close_complete_upcall(ep);
+	release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status,
+	     status2errno(rpl->status));
+	connect_reply_upcall(ep, status2errno(rpl->status));
+	state_set(&ep->com, DEAD);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, GET_TID(rpl), NULL);
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	put_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_start(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_pass_open_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n");
+		return -ENOMEM;
+	}
+
+	req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_port = 0;
+	req->peer_ip = 0;
+	req->peer_netmask = 0;
+	req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS);
+	req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10));
+	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
+
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_pass_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, 
+	     rpl->status, status2errno(rpl->status));
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_stop(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_close_listserv_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
+			     void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_close_listserv_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+	return CPL_RET_BUF_DONE;
+}
+
+static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+	unsigned int mtu_idx;
+	u32 opt0h, opt0l, opt2;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+
+	rpl = cplhdr(skb);
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(opt0h);
+	rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT);
+	rpl->opt2 = htonl(opt2);
+	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
+	skb->priority = CPL_PRIORITY_SETUP;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+
+	return;
+}
+
+static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
+		      struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, 
+	     peer_ip);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(struct cpl_tid_release));
+	skb_get(skb);
+
+	if (tdev->type == T3B)
+		release_tid(tdev, hwtid, skb);
+	else {
+		struct cpl_pass_accept_rpl *rpl;
+
+		rpl = cplhdr(skb);
+		skb->priority = CPL_PRIORITY_SETUP;
+		rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+		OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, 
+						      hwtid));
+		rpl->peer_ip = peer_ip;
+		rpl->opt0h = htonl(F_TCAM_BYPASS);
+		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
+		rpl->opt2 = 0;
+		rpl->rsvd = rpl->opt2;
+		tdev->send(tdev, skb);
+	}
+}
+
+static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *child_ep, *parent_ep = ctx;
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int hwtid = GET_TID(req);
+	struct dst_entry *dst;
+	struct l2t_entry *l2t;
+	struct rtable *rt;
+	struct iff_mac tim;
+
+	PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid);
+
+	if (state_read(&parent_ep->com) != LISTEN) {
+		printk(KERN_ERR "%s - listening ep not in LISTEN\n", 
+		       __FUNCTION__);
+		goto reject;
+	}
+
+	/*
+	 * Find the netdev for this connection request.
+	 */
+	tim.mac_addr = req->dst_mac;
+	tim.vlan_tag = ntohs(req->vlan_tag);
+	if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) {
+		printk(KERN_ERR 
+			"%s bad dst mac %02x %02x %02x %02x %02x %02x\n",
+			__FUNCTION__,
+			req->dst_mac[0],
+			req->dst_mac[1],
+			req->dst_mac[2],
+			req->dst_mac[3],
+			req->dst_mac[4],
+			req->dst_mac[5]);
+		goto reject;
+	}
+
+	/* Find output route */
+	rt = find_route(tdev,
+			req->local_ip,
+			req->peer_ip,
+			req->local_port,
+			req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid)));
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - failed to find dst entry!\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	dst = &rt->u.dst;
+	l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev->if_port);
+	if (!l2t) {
+		printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n",
+		       __FUNCTION__);
+		dst_release(dst);
+		goto reject;
+	}
+	child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL);
+	if (!child_ep) {
+		printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n",
+		       __FUNCTION__);
+		l2t_release(L2DATA(tdev), l2t);
+		dst_release(dst);
+		goto reject;
+	}
+	state_set(&child_ep->com, CONNECTING);
+	child_ep->com.tdev = tdev;
+	child_ep->com.cm_id = NULL;
+	child_ep->com.local_addr.sin_family = PF_INET;
+	child_ep->com.local_addr.sin_port = req->local_port;
+	child_ep->com.local_addr.sin_addr.s_addr = req->local_ip;
+	child_ep->com.remote_addr.sin_family = PF_INET;
+	child_ep->com.remote_addr.sin_port = req->peer_port;
+	child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip;
+	get_ep(&parent_ep->com);
+	child_ep->parent_ep = parent_ep;
+	child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid));
+	child_ep->l2t = l2t;
+	child_ep->dst = dst;
+	child_ep->hwtid = hwtid;
+	init_timer(&child_ep->timer);
+	cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid);
+	accept_cr(child_ep, req->peer_ip, skb);
+	goto out;
+reject:
+	reject_cr(tdev, hwtid, req->peer_ip, skb);
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_pass_establish *req = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	dst_confirm(ep->dst);
+	state_set(&ep->com, MPA_REQ_WAIT);
+	start_ep_timer(ep);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int abort = 0;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	dst_confirm(ep->dst);
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_WAIT:
+		state_set(&ep->com, CLOSING);
+		break;
+	case MPA_REQ_SENT:
+		state_set(&ep->com, CLOSING);
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REQ_RCVD:
+
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		state_set(&ep->com, CLOSING);
+		get_ep(&ep->com);
+		break;
+	case MPA_REP_SENT:
+		state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case FPDU_MODE:
+		state_set(&ep->com, CLOSING);
+		peer_close_upcall(ep);
+		attrs.next_state = IWCH_QP_STATE_CLOSING;
+		ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		if (ret) {
+			printk(KERN_ERR MOD "%s - qp <- closing err!\n",
+			       __FUNCTION__);
+			abort = 1;
+		}
+		break;
+	case ABORTING:
+		goto out;
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		goto out;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+				       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				       &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		goto out;
+	case DEAD:
+		goto out;
+	default:
+		BUG_ON(1);
+	}
+	iwch_ep_disconnect(ep, abort, GFP_KERNEL);	
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+        return status == CPL_ERR_RTX_NEG_ADVICE ||
+               status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_abort_req_rss *req = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+	struct cpl_abort_rpl *rpl;
+	struct sk_buff *rpl_skb;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int state;
+
+	if (is_neg_adv_abort(req->status)) {
+		PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, 
+		     ep->hwtid);
+		t3_l2t_send_event(ep->com.tdev, ep->l2t);
+		return CPL_RET_BUF_DONE;
+	}
+
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state);
+	switch (state) {
+	case CONNECTING:
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REP_SENT:
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case MPA_REQ_RCVD:
+	
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		get_ep(&ep->com);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+	case FPDU_MODE:
+	case CLOSING:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+			if (ret)
+				printk(KERN_ERR MOD 
+				       "%s - qp <- error failed!\n",
+				       __FUNCTION__);
+		}
+		peer_abort_upcall(ep);
+		break;
+	case ABORTING:
+		break;
+	case DEAD:
+		PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__);
+		return CPL_RET_BUF_DONE;
+	default:
+		BUG_ON(1);
+		break;
+	}
+	dst_confirm(ep->dst);
+	
+	rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL);
+	if (!rpl_skb) {
+		printk(KERN_ERR MOD "%s - cannot allocate skb!\n",
+		       __FUNCTION__);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		put_ep(&ep->com);
+		return CPL_RET_BUF_DONE;
+	}
+	rpl_skb->priority = CPL_PRIORITY_DATA;
+	rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl));
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL));
+	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
+	rpl->cmd = CPL_ABORT_NO_RST;
+	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	if (state != ABORTING)
+		release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(!ep);
+
+	/* The cm_id may be null if we failed to connect */
+	switch (state_read(&ep->com)) {
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if ((ep->com.cm_id) && (ep->com.qp)) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+					     ep->com.qp, 
+					     IWCH_QP_ATTR_NEXT_STATE,
+					     &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		break;
+	case DEAD:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * T3A does 3 things when a TERM is received:
+ * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet
+ * 2) generate an async event on the QP with the TERMINATE opcode
+ * 3) post a TERMINATE opcde cqe into the associated CQ.
+ *
+ * For (1), we save the message in the qp for later consumer consumption.
+ * For (2), we move the QP into TERMINATE, post a QP event and disconnect.
+ * For (3), we toss the CQE in cxio_poll_cq().
+ * 
+ * terminate() handles case (1)...
+ */
+static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
+	PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len);
+	memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len);
+	ep->com.qp->attr.terminate_msg_len = skb->len;
+	ep->com.qp->attr.is_terminate_local = 0;
+	return CPL_RET_BUF_DONE;
+}
+
+static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_rdma_ec_status *rep = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, 
+	     rep->status);
+	if (rep->status) {
+		struct iwch_qp_attributes attrs;
+
+		printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n",
+		       __FUNCTION__, ep->hwtid);
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(ep->com.qp->rhp,
+			       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+			       &attrs, 1);
+		abort_connection(ep, NULL);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static void ep_timeout(unsigned long arg)
+{
+	struct iwch_ep *ep = (struct iwch_ep *)arg;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) {
+		struct sk_buff *skb;
+
+		connect_reply_upcall(ep, -ETIMEDOUT);
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) {
+		struct sk_buff *skb;
+
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) {
+		struct sk_buff *skb;
+
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		}
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	put_ep(&ep->com);
+}
+
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	state_set(&ep->com, CLOSING);
+	if (mpa_rev == 0)
+		abort_connection(ep, NULL);
+	else {
+		err = send_mpa_reject(ep, pdata, pdata_len);
+		err = send_halfclose(ep, GFP_KERNEL);
+	}
+	return 0;
+}
+
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	struct iwch_ep *ep = to_ep(cm_id);
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	BUG_ON(!qp);
+
+	if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) ||
+	    (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) {
+		abort_connection(ep, NULL);
+		return -EINVAL;
+	}
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = qp;
+
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+	get_ep(&ep->com);
+	err = send_mpa_reply(ep, conn_param->private_data, 
+			     conn_param->private_data_len);
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+		put_ep(&ep->com);
+		return err;
+	}
+	
+	/* bind QP to EP and move to RTS */
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ord;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	/* bind QP and TID with INIT_WR */
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+			     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+			     IWCH_QP_ATTR_MPA_ATTR |
+			     IWCH_QP_ATTR_MAX_IRD |
+			     IWCH_QP_ATTR_MAX_ORD;
+
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+	} else {
+		state_set(&ep->com, FPDU_MODE);
+		established_upcall(ep);
+	}
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_ep *ep;
+	struct rtable *rt;
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto out;
+	}
+	init_timer(&ep->timer);
+	ep->plen = conn_param->private_data_len;
+	if (ep->plen)
+		memcpy(ep->mpa_pkt + sizeof(struct mpa_message), 
+		       conn_param->private_data, ep->plen);
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	ep->com.tdev = h->rdev.t3cdev_p;
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+	PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, 
+	     ep->com.qp, cm_id);
+
+	/* 
+	 * Allocate an active TID to initiate a TCP connection. 
+	 */
+	ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->atid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	/* find a route */
+	rt = find_route(h->rdev.t3cdev_p,
+			cm_id->local_addr.sin_addr.s_addr,
+			cm_id->remote_addr.sin_addr.s_addr,
+			cm_id->local_addr.sin_port,
+			cm_id->remote_addr.sin_port, IPTOS_LOWDELAY);
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__);
+		err = -EHOSTUNREACH;
+		goto fail3;
+	}
+	ep->dst = &rt->u.dst;
+
+	/* get a l2t entry */
+	ep->l2t = t3_l2t_get(ep->com.tdev,
+			     ep->dst->neighbour,
+			     ep->dst->neighbour->dev->if_port);
+	if (!ep->l2t) {
+		printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail4;
+	}
+
+	state_set(&ep->com, CONNECTING);
+	ep->tos = IPTOS_LOWDELAY;
+	ep->com.local_addr = cm_id->local_addr;
+	ep->com.remote_addr = cm_id->remote_addr;
+
+	/* send connect request to rnic */
+	err = send_connect(ep);
+	if (!err)
+		goto out;
+
+	l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t);
+fail4:
+	dst_release(ep->dst);
+fail3:
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+fail2:
+	put_ep(&ep->com);
+out:
+	return err;
+}
+
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_listen_ep *ep;
+
+
+	might_sleep();
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail1;
+	}
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.tdev = h->rdev.t3cdev_p;
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->backlog = backlog;
+	ep->com.local_addr = cm_id->local_addr;
+
+	/* 
+	 * Allocate a server TID.
+	 */
+	ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->stid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	state_set(&ep->com, LISTEN);
+	err = listen_start(ep);
+	if (err)
+		goto fail3;
+
+	/* wait for pass_open_rpl */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (!err) {
+		cm_id->provider_data = ep;
+		goto out;
+	}
+fail3:
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+fail2:
+	put_ep(&ep->com);
+fail1:
+out:
+	return err;
+}
+
+int iwch_destroy_listen(struct iw_cm_id *cm_id)
+{
+	int err;
+	struct iwch_listen_ep *ep = to_listen_ep(cm_id);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	might_sleep();
+	state_set(&ep->com, DEAD);
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	err = listen_stop(ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+	err = ep->com.rpl_err;
+	cm_id->rem_ref(cm_id);
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
+{
+	int ret=0;
+	int state;
+
+	
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, 
+	     states[state], abrupt);
+	if (state == DEAD) {
+		PDBG("%s already dead ep %p\n", __FUNCTION__, ep);
+		return 0;
+	}
+	if (abrupt) {
+		if (state != ABORTING) {
+			state_set(&ep->com, ABORTING);
+			ret = send_abort(ep, NULL, gfp);
+		}
+	} else {
+
+		if (state != CLOSING)
+			state_set(&ep->com, CLOSING);
+		else {
+			start_ep_timer(ep);
+			state_set(&ep->com, MORIBUND);
+		}
+
+		ret = send_halfclose(ep, gfp);
+	}
+	return ret;
+}
+
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, 
+		     struct l2t_entry *l2t)
+{
+	struct iwch_ep *ep = ctx;
+	
+	if (ep->dst != old)
+		return 0;
+
+	PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, 
+	     l2t);
+	dst_hold(new);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	ep->l2t = l2t;
+	dst_release(old);
+	ep->dst = new;
+	return 1;
+}
+
+/* 
+ * All the CM events are handled on a work queue to have a safe context.
+ */
+static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep_common *epc = ctx;
+
+	get_ep(epc);
+
+	/*
+	 * Save ctx and tdev in the skb->cb area.
+	 */
+	*((void **) skb->cb) = ctx;
+	*((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev;
+
+	/* 
+	 * Queue the skb and schedule the worker thread.
+	 */
+	skb_queue_tail(&rxq, skb);
+	queue_work(workq, &skb_work);
+	return 0;
+}
+
+int __init iwch_cm_init(void)
+{
+	skb_queue_head_init(&rxq);
+
+	workq = create_singlethread_workqueue("iw_cxgb3");
+	if (!workq)
+		return -ENOMEM;
+
+	/*
+	 * All upcalls from the T3 Core go to sched() to 
+	 * schedule the processing on a work queue.
+	 */
+	t3c_handlers[CPL_ACT_ESTABLISH] = sched;
+	t3c_handlers[CPL_ACT_OPEN_RPL] = sched;
+	t3c_handlers[CPL_RX_DATA] = sched;
+	t3c_handlers[CPL_TX_DMA_ACK] = sched;
+	t3c_handlers[CPL_ABORT_RPL_RSS] = sched;
+	t3c_handlers[CPL_ABORT_RPL] = sched;
+	t3c_handlers[CPL_PASS_OPEN_RPL] = sched;
+	t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched;
+	t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched;
+	t3c_handlers[CPL_PASS_ESTABLISH] = sched;
+	t3c_handlers[CPL_PEER_CLOSE] = sched;
+	t3c_handlers[CPL_CLOSE_CON_RPL] = sched;
+	t3c_handlers[CPL_ABORT_REQ_RSS] = sched;
+	t3c_handlers[CPL_RDMA_TERMINATE] = sched;
+	t3c_handlers[CPL_RDMA_EC_STATUS] = sched;
+
+	/*
+	 * These are the real handlers that are called from a 
+	 * work queue.
+	 */
+	work_handlers[CPL_ACT_ESTABLISH] = act_establish;
+	work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl;
+	work_handlers[CPL_RX_DATA] = rx_data;
+	work_handlers[CPL_TX_DMA_ACK] = tx_ack;
+	work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl;
+	work_handlers[CPL_ABORT_RPL] = abort_rpl;
+	work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl;
+	work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl;
+	work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req;
+	work_handlers[CPL_PASS_ESTABLISH] = pass_establish;
+	work_handlers[CPL_PEER_CLOSE] = peer_close;
+	work_handlers[CPL_ABORT_REQ_RSS] = peer_abort;
+	work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl;
+	work_handlers[CPL_RDMA_TERMINATE] = terminate;
+	work_handlers[CPL_RDMA_EC_STATUS] = ec_status;
+	return 0;
+}
+
+void __exit iwch_cm_term(void)
+{
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
new file mode 100644
index 0000000..893f9d0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _IWCH_CM_H_
+#define _IWCH_CM_H_
+
+#include <linux/inet.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/iw_cm.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+
+#define MPA_KEY_REQ "MPA ID Req Frame"
+#define MPA_KEY_REP "MPA ID Rep Frame"
+
+#define MPA_MAX_PRIVATE_DATA 	256
+#define MPA_REV 		0	/* XXX - amso1100 uses rev 0 ! */
+#define MPA_REJECT 		0x20
+#define MPA_CRC			0x40
+#define MPA_MARKERS		0x80
+#define MPA_FLAGS_MASK		0xE0
+
+#define put_ep(ep) { \
+	PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__,  \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_put(&((ep)->kref), __free_ep); \
+}
+
+#define get_ep(ep) { \
+	PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_get(&((ep)->kref));  \
+}
+
+struct mpa_message {
+	u8 key[16];
+	u8 flags;
+	u8 revision;
+	__be16 private_data_size;
+	u8 private_data[0];
+};
+
+struct terminate_message {
+	u8 layer_etype;
+	u8 ecode;
+	__be16 hdrct_rsvd;
+	u8 len_hdrs[0];
+};
+
+#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28)
+
+enum iwch_layers_types {
+	LAYER_RDMAP 		= 0x00,
+	LAYER_DDP		= 0x10,
+	LAYER_MPA		= 0x20,
+	RDMAP_LOCAL_CATA	= 0x00,
+	RDMAP_REMOTE_PROT	= 0x01,
+	RDMAP_REMOTE_OP		= 0x02,
+	DDP_LOCAL_CATA		= 0x00,
+	DDP_TAGGED_ERR		= 0x01,
+	DDP_UNTAGGED_ERR	= 0x02,
+	DDP_LLP			= 0x03
+};
+
+enum iwch_rdma_ecodes {
+	RDMAP_INV_STAG		= 0x00,
+	RDMAP_BASE_BOUNDS	= 0x01,
+	RDMAP_ACC_VIOL		= 0x02,
+	RDMAP_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_TO_WRAP		= 0x04,
+	RDMAP_INV_VERS		= 0x05,
+	RDMAP_INV_OPCODE	= 0x06,
+	RDMAP_STREAM_CATA	= 0x07,
+	RDMAP_GLOBAL_CATA	= 0x08,
+	RDMAP_CANT_INV_STAG	= 0x09,
+	RDMAP_UNSPECIFIED	= 0xff	
+};
+
+enum iwch_ddp_ecodes {
+	DDPT_INV_STAG		= 0x00,
+	DDPT_BASE_BOUNDS	= 0x01,
+	DDPT_STAG_NOT_ASSOC	= 0x02,
+	DDPT_TO_WRAP		= 0x03,
+	DDPT_INV_VERS		= 0x04,
+	DDPU_INV_QN		= 0x01,
+	DDPU_INV_MSN_NOBUF	= 0x02,
+	DDPU_INV_MSN_RANGE	= 0x03,
+	DDPU_INV_MO		= 0x04,
+	DDPU_MSG_TOOBIG		= 0x05,
+	DDPU_INV_VERS		= 0x06
+};
+
+enum iwch_mpa_ecodes {
+	MPA_CRC_ERR		= 0x02,
+	MPA_MARKER_ERR		= 0x03
+};
+
+enum iwch_ep_state {
+	IDLE = 0,
+	LISTEN,	
+	CONNECTING,
+	MPA_REQ_WAIT,
+	MPA_REQ_SENT,
+	MPA_REQ_RCVD,
+	MPA_REP_SENT,
+	FPDU_MODE,
+	ABORTING,
+	CLOSING,
+	MORIBUND,
+	DEAD,
+};
+
+struct iwch_ep_common {
+	struct iw_cm_id *cm_id;
+	struct iwch_qp *qp;
+	struct t3cdev *tdev;
+	enum iwch_ep_state state;
+	struct kref kref;
+	spinlock_t lock;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	wait_queue_head_t waitq;
+	int rpl_done;
+	int rpl_err;
+};
+
+struct iwch_listen_ep {
+	struct iwch_ep_common com;
+	unsigned int stid;
+	int backlog;
+};
+
+struct iwch_ep {
+	struct iwch_ep_common com;
+	struct iwch_ep *parent_ep;
+	struct timer_list timer;
+	unsigned int atid;
+	u32 hwtid;
+	u32 snd_seq;
+	struct l2t_entry *l2t;
+	struct dst_entry *dst;
+	struct sk_buff *mpa_skb;
+	struct iwch_mpa_attributes mpa_attr;
+	unsigned int mpa_pkt_len;
+	u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA];
+	u8 tos;
+	u16 emss;
+	u16 plen;
+	u32 ird;
+	u32 ord;
+};
+
+static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_ep *)cm_id->provider_data;
+}
+
+static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_listen_ep *)cm_id->provider_data;
+}
+
+static inline int compute_wscale(int win)
+{
+	int wscale = 0;
+
+	while (wscale < 14 && (65535<<wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+/* CM prototypes */
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog);
+int iwch_destroy_listen(struct iw_cm_id *cm_id);
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp);
+int iwch_quiesce_tid(struct iwch_ep *ep);
+int iwch_resume_tid(struct iwch_ep *ep);
+void __free_ep(struct kref *kref);
+void iwch_rearp(struct iwch_ep *ep);
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t);
+
+int __init iwch_cm_init(void);
+void __exit iwch_cm_term(void);
+
+#endif				/* _IWCH_CM_H_ */
diff --git a/drivers/infiniband/hw/cxgb3/tcb.h b/drivers/infiniband/hw/cxgb3/tcb.h
new file mode 100644
index 0000000..f287a7c
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/tcb.h
@@ -0,0 +1,603 @@
+/* This file is automatically generated --- do not edit */
+
+#ifndef _TCB_DEFS_H
+#define _TCB_DEFS_H
+
+#define W_TCB_T_STATE    0
+#define S_TCB_T_STATE    0
+#define M_TCB_T_STATE    0xfULL
+#define V_TCB_T_STATE(x) ((x) << S_TCB_T_STATE)
+
+#define W_TCB_TIMER    0
+#define S_TCB_TIMER    4
+#define M_TCB_TIMER    0x1ULL
+#define V_TCB_TIMER(x) ((x) << S_TCB_TIMER)
+
+#define W_TCB_DACK_TIMER    0
+#define S_TCB_DACK_TIMER    5
+#define M_TCB_DACK_TIMER    0x1ULL
+#define V_TCB_DACK_TIMER(x) ((x) << S_TCB_DACK_TIMER)
+
+#define W_TCB_DEL_FLAG    0
+#define S_TCB_DEL_FLAG    6
+#define M_TCB_DEL_FLAG    0x1ULL
+#define V_TCB_DEL_FLAG(x) ((x) << S_TCB_DEL_FLAG)
+
+#define W_TCB_L2T_IX    0
+#define S_TCB_L2T_IX    7
+#define M_TCB_L2T_IX    0x7ffULL
+#define V_TCB_L2T_IX(x) ((x) << S_TCB_L2T_IX)
+
+#define W_TCB_SMAC_SEL    0
+#define S_TCB_SMAC_SEL    18
+#define M_TCB_SMAC_SEL    0x3ULL
+#define V_TCB_SMAC_SEL(x) ((x) << S_TCB_SMAC_SEL)
+
+#define W_TCB_TOS    0
+#define S_TCB_TOS    20
+#define M_TCB_TOS    0x3fULL
+#define V_TCB_TOS(x) ((x) << S_TCB_TOS)
+
+#define W_TCB_MAX_RT    0
+#define S_TCB_MAX_RT    26
+#define M_TCB_MAX_RT    0xfULL
+#define V_TCB_MAX_RT(x) ((x) << S_TCB_MAX_RT)
+
+#define W_TCB_T_RXTSHIFT    0
+#define S_TCB_T_RXTSHIFT    30
+#define M_TCB_T_RXTSHIFT    0xfULL
+#define V_TCB_T_RXTSHIFT(x) ((x) << S_TCB_T_RXTSHIFT)
+
+#define W_TCB_T_DUPACKS    1
+#define S_TCB_T_DUPACKS    2
+#define M_TCB_T_DUPACKS    0xfULL
+#define V_TCB_T_DUPACKS(x) ((x) << S_TCB_T_DUPACKS)
+
+#define W_TCB_T_MAXSEG    1
+#define S_TCB_T_MAXSEG    6
+#define M_TCB_T_MAXSEG    0xfULL
+#define V_TCB_T_MAXSEG(x) ((x) << S_TCB_T_MAXSEG)
+
+#define W_TCB_T_FLAGS1    1
+#define S_TCB_T_FLAGS1    10
+#define M_TCB_T_FLAGS1    0xffffffffULL
+#define V_TCB_T_FLAGS1(x) ((x) << S_TCB_T_FLAGS1)
+
+#define W_TCB_T_MIGRATION    1
+#define S_TCB_T_MIGRATION    20
+#define M_TCB_T_MIGRATION    0x1ULL
+#define V_TCB_T_MIGRATION(x) ((x) << S_TCB_T_MIGRATION)
+
+#define W_TCB_T_FLAGS2    2
+#define S_TCB_T_FLAGS2    10
+#define M_TCB_T_FLAGS2    0x7fULL
+#define V_TCB_T_FLAGS2(x) ((x) << S_TCB_T_FLAGS2)
+
+#define W_TCB_SND_SCALE    2
+#define S_TCB_SND_SCALE    17
+#define M_TCB_SND_SCALE    0xfULL
+#define V_TCB_SND_SCALE(x) ((x) << S_TCB_SND_SCALE)
+
+#define W_TCB_RCV_SCALE    2
+#define S_TCB_RCV_SCALE    21
+#define M_TCB_RCV_SCALE    0xfULL
+#define V_TCB_RCV_SCALE(x) ((x) << S_TCB_RCV_SCALE)
+
+#define W_TCB_SND_UNA_RAW    2
+#define S_TCB_SND_UNA_RAW    25
+#define M_TCB_SND_UNA_RAW    0x7ffffffULL
+#define V_TCB_SND_UNA_RAW(x) ((x) << S_TCB_SND_UNA_RAW)
+
+#define W_TCB_SND_NXT_RAW    3
+#define S_TCB_SND_NXT_RAW    20
+#define M_TCB_SND_NXT_RAW    0x7ffffffULL
+#define V_TCB_SND_NXT_RAW(x) ((x) << S_TCB_SND_NXT_RAW)
+
+#define W_TCB_RCV_NXT    4
+#define S_TCB_RCV_NXT    15
+#define M_TCB_RCV_NXT    0xffffffffULL
+#define V_TCB_RCV_NXT(x) ((x) << S_TCB_RCV_NXT)
+
+#define W_TCB_RCV_ADV    5
+#define S_TCB_RCV_ADV    15
+#define M_TCB_RCV_ADV    0xffffULL
+#define V_TCB_RCV_ADV(x) ((x) << S_TCB_RCV_ADV)
+
+#define W_TCB_SND_MAX_RAW    5
+#define S_TCB_SND_MAX_RAW    31
+#define M_TCB_SND_MAX_RAW    0x7ffffffULL
+#define V_TCB_SND_MAX_RAW(x) ((x) << S_TCB_SND_MAX_RAW)
+
+#define W_TCB_SND_CWND    6
+#define S_TCB_SND_CWND    26
+#define M_TCB_SND_CWND    0x7ffffffULL
+#define V_TCB_SND_CWND(x) ((x) << S_TCB_SND_CWND)
+
+#define W_TCB_SND_SSTHRESH    7
+#define S_TCB_SND_SSTHRESH    21
+#define M_TCB_SND_SSTHRESH    0x7ffffffULL
+#define V_TCB_SND_SSTHRESH(x) ((x) << S_TCB_SND_SSTHRESH)
+
+#define W_TCB_T_RTT_TS_RECENT_AGE    8
+#define S_TCB_T_RTT_TS_RECENT_AGE    16
+#define M_TCB_T_RTT_TS_RECENT_AGE    0xffffffffULL
+#define V_TCB_T_RTT_TS_RECENT_AGE(x) ((x) << S_TCB_T_RTT_TS_RECENT_AGE)
+
+#define W_TCB_T_RTSEQ_RECENT    9
+#define S_TCB_T_RTSEQ_RECENT    16
+#define M_TCB_T_RTSEQ_RECENT    0xffffffffULL
+#define V_TCB_T_RTSEQ_RECENT(x) ((x) << S_TCB_T_RTSEQ_RECENT)
+
+#define W_TCB_T_SRTT    10
+#define S_TCB_T_SRTT    16
+#define M_TCB_T_SRTT    0xffffULL
+#define V_TCB_T_SRTT(x) ((x) << S_TCB_T_SRTT)
+
+#define W_TCB_T_RTTVAR    11
+#define S_TCB_T_RTTVAR    0
+#define M_TCB_T_RTTVAR    0xffffULL
+#define V_TCB_T_RTTVAR(x) ((x) << S_TCB_T_RTTVAR)
+
+#define W_TCB_TS_LAST_ACK_SENT_RAW    11
+#define S_TCB_TS_LAST_ACK_SENT_RAW    16
+#define M_TCB_TS_LAST_ACK_SENT_RAW    0x7ffffffULL
+#define V_TCB_TS_LAST_ACK_SENT_RAW(x) ((x) << S_TCB_TS_LAST_ACK_SENT_RAW)
+
+#define W_TCB_DIP    12
+#define S_TCB_DIP    11
+#define M_TCB_DIP    0xffffffffULL
+#define V_TCB_DIP(x) ((x) << S_TCB_DIP)
+
+#define W_TCB_SIP    13
+#define S_TCB_SIP    11
+#define M_TCB_SIP    0xffffffffULL
+#define V_TCB_SIP(x) ((x) << S_TCB_SIP)
+
+#define W_TCB_DP    14
+#define S_TCB_DP    11
+#define M_TCB_DP    0xffffULL
+#define V_TCB_DP(x) ((x) << S_TCB_DP)
+
+#define W_TCB_SP    14
+#define S_TCB_SP    27
+#define M_TCB_SP    0xffffULL
+#define V_TCB_SP(x) ((x) << S_TCB_SP)
+
+#define W_TCB_TIMESTAMP    15
+#define S_TCB_TIMESTAMP    11
+#define M_TCB_TIMESTAMP    0xffffffffULL
+#define V_TCB_TIMESTAMP(x) ((x) << S_TCB_TIMESTAMP)
+
+#define W_TCB_TIMESTAMP_OFFSET    16
+#define S_TCB_TIMESTAMP_OFFSET    11
+#define M_TCB_TIMESTAMP_OFFSET    0xfULL
+#define V_TCB_TIMESTAMP_OFFSET(x) ((x) << S_TCB_TIMESTAMP_OFFSET)
+
+#define W_TCB_TX_MAX    16
+#define S_TCB_TX_MAX    15
+#define M_TCB_TX_MAX    0xffffffffULL
+#define V_TCB_TX_MAX(x) ((x) << S_TCB_TX_MAX)
+
+#define W_TCB_TX_HDR_PTR_RAW    17
+#define S_TCB_TX_HDR_PTR_RAW    15
+#define M_TCB_TX_HDR_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_HDR_PTR_RAW(x) ((x) << S_TCB_TX_HDR_PTR_RAW)
+
+#define W_TCB_TX_LAST_PTR_RAW    18
+#define S_TCB_TX_LAST_PTR_RAW    0
+#define M_TCB_TX_LAST_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_LAST_PTR_RAW(x) ((x) << S_TCB_TX_LAST_PTR_RAW)
+
+#define W_TCB_TX_COMPACT    18
+#define S_TCB_TX_COMPACT    17
+#define M_TCB_TX_COMPACT    0x1ULL
+#define V_TCB_TX_COMPACT(x) ((x) << S_TCB_TX_COMPACT)
+
+#define W_TCB_RX_COMPACT    18
+#define S_TCB_RX_COMPACT    18
+#define M_TCB_RX_COMPACT    0x1ULL
+#define V_TCB_RX_COMPACT(x) ((x) << S_TCB_RX_COMPACT)
+
+#define W_TCB_RCV_WND    18
+#define S_TCB_RCV_WND    19
+#define M_TCB_RCV_WND    0x7ffffffULL
+#define V_TCB_RCV_WND(x) ((x) << S_TCB_RCV_WND)
+
+#define W_TCB_RX_HDR_OFFSET    19
+#define S_TCB_RX_HDR_OFFSET    14
+#define M_TCB_RX_HDR_OFFSET    0x7ffffffULL
+#define V_TCB_RX_HDR_OFFSET(x) ((x) << S_TCB_RX_HDR_OFFSET)
+
+#define W_TCB_RX_FRAG0_START_IDX_RAW    20
+#define S_TCB_RX_FRAG0_START_IDX_RAW    9
+#define M_TCB_RX_FRAG0_START_IDX_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG0_START_IDX_RAW(x) ((x) << S_TCB_RX_FRAG0_START_IDX_RAW)
+
+#define W_TCB_RX_FRAG1_START_IDX_OFFSET    21
+#define S_TCB_RX_FRAG1_START_IDX_OFFSET    4
+#define M_TCB_RX_FRAG1_START_IDX_OFFSET    0x7ffffffULL
+#define V_TCB_RX_FRAG1_START_IDX_OFFSET(x) ((x) << S_TCB_RX_FRAG1_START_IDX_OFFSET)
+
+#define W_TCB_RX_FRAG0_LEN    21
+#define S_TCB_RX_FRAG0_LEN    31
+#define M_TCB_RX_FRAG0_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG0_LEN(x) ((x) << S_TCB_RX_FRAG0_LEN)
+
+#define W_TCB_RX_FRAG1_LEN    22
+#define S_TCB_RX_FRAG1_LEN    26
+#define M_TCB_RX_FRAG1_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG1_LEN(x) ((x) << S_TCB_RX_FRAG1_LEN)
+
+#define W_TCB_NEWRENO_RECOVER    23
+#define S_TCB_NEWRENO_RECOVER    21
+#define M_TCB_NEWRENO_RECOVER    0x7ffffffULL
+#define V_TCB_NEWRENO_RECOVER(x) ((x) << S_TCB_NEWRENO_RECOVER)
+
+#define W_TCB_PDU_HAVE_LEN    24
+#define S_TCB_PDU_HAVE_LEN    16
+#define M_TCB_PDU_HAVE_LEN    0x1ULL
+#define V_TCB_PDU_HAVE_LEN(x) ((x) << S_TCB_PDU_HAVE_LEN)
+
+#define W_TCB_PDU_LEN    24
+#define S_TCB_PDU_LEN    17
+#define M_TCB_PDU_LEN    0xffffULL
+#define V_TCB_PDU_LEN(x) ((x) << S_TCB_PDU_LEN)
+
+#define W_TCB_RX_QUIESCE    25
+#define S_TCB_RX_QUIESCE    1
+#define M_TCB_RX_QUIESCE    0x1ULL
+#define V_TCB_RX_QUIESCE(x) ((x) << S_TCB_RX_QUIESCE)
+
+#define W_TCB_RX_PTR_RAW    25
+#define S_TCB_RX_PTR_RAW    2
+#define M_TCB_RX_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_PTR_RAW(x) ((x) << S_TCB_RX_PTR_RAW)
+
+#define W_TCB_CPU_NO    25
+#define S_TCB_CPU_NO    19
+#define M_TCB_CPU_NO    0x7fULL
+#define V_TCB_CPU_NO(x) ((x) << S_TCB_CPU_NO)
+
+#define W_TCB_ULP_TYPE    25
+#define S_TCB_ULP_TYPE    26
+#define M_TCB_ULP_TYPE    0xfULL
+#define V_TCB_ULP_TYPE(x) ((x) << S_TCB_ULP_TYPE)
+
+#define W_TCB_RX_FRAG1_PTR_RAW    25
+#define S_TCB_RX_FRAG1_PTR_RAW    30
+#define M_TCB_RX_FRAG1_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    26
+#define S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    15
+#define M_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW)
+
+#define W_TCB_RX_FRAG2_PTR_RAW    27
+#define S_TCB_RX_FRAG2_PTR_RAW    10
+#define M_TCB_RX_FRAG2_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG2_PTR_RAW(x) ((x) << S_TCB_RX_FRAG2_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_LEN_RAW    27
+#define S_TCB_RX_FRAG2_LEN_RAW    27
+#define M_TCB_RX_FRAG2_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_LEN_RAW(x) ((x) << S_TCB_RX_FRAG2_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_PTR_RAW    28
+#define S_TCB_RX_FRAG3_PTR_RAW    22
+#define M_TCB_RX_FRAG3_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG3_PTR_RAW(x) ((x) << S_TCB_RX_FRAG3_PTR_RAW)
+
+#define W_TCB_RX_FRAG3_LEN_RAW    29
+#define S_TCB_RX_FRAG3_LEN_RAW    7
+#define M_TCB_RX_FRAG3_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_LEN_RAW(x) ((x) << S_TCB_RX_FRAG3_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    30
+#define S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    2
+#define M_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW)
+
+#define W_TCB_PDU_HDR_LEN    30
+#define S_TCB_PDU_HDR_LEN    29
+#define M_TCB_PDU_HDR_LEN    0xffULL
+#define V_TCB_PDU_HDR_LEN(x) ((x) << S_TCB_PDU_HDR_LEN)
+
+#define W_TCB_SLUSH1    31
+#define S_TCB_SLUSH1    5
+#define M_TCB_SLUSH1    0x7ffffULL
+#define V_TCB_SLUSH1(x) ((x) << S_TCB_SLUSH1)
+
+#define W_TCB_ULP_RAW    31
+#define S_TCB_ULP_RAW    24
+#define M_TCB_ULP_RAW    0xffULL
+#define V_TCB_ULP_RAW(x) ((x) << S_TCB_ULP_RAW)
+
+#define W_TCB_DDP_RDMAP_VERSION    25
+#define S_TCB_DDP_RDMAP_VERSION    30
+#define M_TCB_DDP_RDMAP_VERSION    0x1ULL
+#define V_TCB_DDP_RDMAP_VERSION(x) ((x) << S_TCB_DDP_RDMAP_VERSION)
+
+#define W_TCB_MARKER_ENABLE_RX    25
+#define S_TCB_MARKER_ENABLE_RX    31
+#define M_TCB_MARKER_ENABLE_RX    0x1ULL
+#define V_TCB_MARKER_ENABLE_RX(x) ((x) << S_TCB_MARKER_ENABLE_RX)
+
+#define W_TCB_MARKER_ENABLE_TX    26
+#define S_TCB_MARKER_ENABLE_TX    0
+#define M_TCB_MARKER_ENABLE_TX    0x1ULL
+#define V_TCB_MARKER_ENABLE_TX(x) ((x) << S_TCB_MARKER_ENABLE_TX)
+
+#define W_TCB_CRC_ENABLE    26
+#define S_TCB_CRC_ENABLE    1
+#define M_TCB_CRC_ENABLE    0x1ULL
+#define V_TCB_CRC_ENABLE(x) ((x) << S_TCB_CRC_ENABLE)
+
+#define W_TCB_IRS_ULP    26
+#define S_TCB_IRS_ULP    2
+#define M_TCB_IRS_ULP    0x1ffULL
+#define V_TCB_IRS_ULP(x) ((x) << S_TCB_IRS_ULP)
+
+#define W_TCB_ISS_ULP    26
+#define S_TCB_ISS_ULP    11
+#define M_TCB_ISS_ULP    0x1ffULL
+#define V_TCB_ISS_ULP(x) ((x) << S_TCB_ISS_ULP)
+
+#define W_TCB_TX_PDU_LEN    26
+#define S_TCB_TX_PDU_LEN    20
+#define M_TCB_TX_PDU_LEN    0x3fffULL
+#define V_TCB_TX_PDU_LEN(x) ((x) << S_TCB_TX_PDU_LEN)
+
+#define W_TCB_TX_PDU_OUT    27
+#define S_TCB_TX_PDU_OUT    2
+#define M_TCB_TX_PDU_OUT    0x1ULL
+#define V_TCB_TX_PDU_OUT(x) ((x) << S_TCB_TX_PDU_OUT)
+
+#define W_TCB_CQ_IDX_SQ    27
+#define S_TCB_CQ_IDX_SQ    3
+#define M_TCB_CQ_IDX_SQ    0xffffULL
+#define V_TCB_CQ_IDX_SQ(x) ((x) << S_TCB_CQ_IDX_SQ)
+
+#define W_TCB_CQ_IDX_RQ    27
+#define S_TCB_CQ_IDX_RQ    19
+#define M_TCB_CQ_IDX_RQ    0xffffULL
+#define V_TCB_CQ_IDX_RQ(x) ((x) << S_TCB_CQ_IDX_RQ)
+
+#define W_TCB_QP_ID    28
+#define S_TCB_QP_ID    3
+#define M_TCB_QP_ID    0xffffULL
+#define V_TCB_QP_ID(x) ((x) << S_TCB_QP_ID)
+
+#define W_TCB_PD_ID    28
+#define S_TCB_PD_ID    19
+#define M_TCB_PD_ID    0xffffULL
+#define V_TCB_PD_ID(x) ((x) << S_TCB_PD_ID)
+
+#define W_TCB_STAG    29
+#define S_TCB_STAG    3
+#define M_TCB_STAG    0xffffffffULL
+#define V_TCB_STAG(x) ((x) << S_TCB_STAG)
+
+#define W_TCB_RQ_START    30
+#define S_TCB_RQ_START    3
+#define M_TCB_RQ_START    0x3ffffffULL
+#define V_TCB_RQ_START(x) ((x) << S_TCB_RQ_START)
+
+#define W_TCB_RQ_MSN    30
+#define S_TCB_RQ_MSN    29
+#define M_TCB_RQ_MSN    0x3ffULL
+#define V_TCB_RQ_MSN(x) ((x) << S_TCB_RQ_MSN)
+
+#define W_TCB_RQ_MAX_OFFSET    31
+#define S_TCB_RQ_MAX_OFFSET    7
+#define M_TCB_RQ_MAX_OFFSET    0xfULL
+#define V_TCB_RQ_MAX_OFFSET(x) ((x) << S_TCB_RQ_MAX_OFFSET)
+
+#define W_TCB_RQ_WRITE_PTR    31
+#define S_TCB_RQ_WRITE_PTR    11
+#define M_TCB_RQ_WRITE_PTR    0x3ffULL
+#define V_TCB_RQ_WRITE_PTR(x) ((x) << S_TCB_RQ_WRITE_PTR)
+
+#define W_TCB_INB_WRITE_PERM    31
+#define S_TCB_INB_WRITE_PERM    21
+#define M_TCB_INB_WRITE_PERM    0x1ULL
+#define V_TCB_INB_WRITE_PERM(x) ((x) << S_TCB_INB_WRITE_PERM)
+
+#define W_TCB_INB_READ_PERM    31
+#define S_TCB_INB_READ_PERM    22
+#define M_TCB_INB_READ_PERM    0x1ULL
+#define V_TCB_INB_READ_PERM(x) ((x) << S_TCB_INB_READ_PERM)
+
+#define W_TCB_ORD_L_BIT_VLD    31
+#define S_TCB_ORD_L_BIT_VLD    23
+#define M_TCB_ORD_L_BIT_VLD    0x1ULL
+#define V_TCB_ORD_L_BIT_VLD(x) ((x) << S_TCB_ORD_L_BIT_VLD)
+
+#define W_TCB_RDMAP_OPCODE    31
+#define S_TCB_RDMAP_OPCODE    24
+#define M_TCB_RDMAP_OPCODE    0xfULL
+#define V_TCB_RDMAP_OPCODE(x) ((x) << S_TCB_RDMAP_OPCODE)
+
+#define W_TCB_TX_FLUSH    31
+#define S_TCB_TX_FLUSH    28
+#define M_TCB_TX_FLUSH    0x1ULL
+#define V_TCB_TX_FLUSH(x) ((x) << S_TCB_TX_FLUSH)
+
+#define W_TCB_TX_OOS_RXMT    31
+#define S_TCB_TX_OOS_RXMT    29
+#define M_TCB_TX_OOS_RXMT    0x1ULL
+#define V_TCB_TX_OOS_RXMT(x) ((x) << S_TCB_TX_OOS_RXMT)
+
+#define W_TCB_TX_OOS_TXMT    31
+#define S_TCB_TX_OOS_TXMT    30
+#define M_TCB_TX_OOS_TXMT    0x1ULL
+#define V_TCB_TX_OOS_TXMT(x) ((x) << S_TCB_TX_OOS_TXMT)
+
+#define W_TCB_SLUSH_AUX2    31
+#define S_TCB_SLUSH_AUX2    31
+#define M_TCB_SLUSH_AUX2    0x1ULL
+#define V_TCB_SLUSH_AUX2(x) ((x) << S_TCB_SLUSH_AUX2)
+
+#define W_TCB_RX_FRAG1_PTR_RAW2    25
+#define S_TCB_RX_FRAG1_PTR_RAW2    30
+#define M_TCB_RX_FRAG1_PTR_RAW2    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW2(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW2)
+
+#define W_TCB_RX_DDP_FLAGS    26
+#define S_TCB_RX_DDP_FLAGS    15
+#define M_TCB_RX_DDP_FLAGS    0x3ffULL
+#define V_TCB_RX_DDP_FLAGS(x) ((x) << S_TCB_RX_DDP_FLAGS)
+
+#define W_TCB_SLUSH_AUX3    26
+#define S_TCB_SLUSH_AUX3    31
+#define M_TCB_SLUSH_AUX3    0x1ffULL
+#define V_TCB_SLUSH_AUX3(x) ((x) << S_TCB_SLUSH_AUX3)
+
+#define W_TCB_RX_DDP_BUF0_OFFSET    27
+#define S_TCB_RX_DDP_BUF0_OFFSET    8
+#define M_TCB_RX_DDP_BUF0_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF0_OFFSET)
+
+#define W_TCB_RX_DDP_BUF0_LEN    27
+#define S_TCB_RX_DDP_BUF0_LEN    30
+#define M_TCB_RX_DDP_BUF0_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_LEN(x) ((x) << S_TCB_RX_DDP_BUF0_LEN)
+
+#define W_TCB_RX_DDP_BUF1_OFFSET    28
+#define S_TCB_RX_DDP_BUF1_OFFSET    20
+#define M_TCB_RX_DDP_BUF1_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF1_OFFSET)
+
+#define W_TCB_RX_DDP_BUF1_LEN    29
+#define S_TCB_RX_DDP_BUF1_LEN    10
+#define M_TCB_RX_DDP_BUF1_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_LEN(x) ((x) << S_TCB_RX_DDP_BUF1_LEN)
+
+#define W_TCB_RX_DDP_BUF0_TAG    30
+#define S_TCB_RX_DDP_BUF0_TAG    0
+#define M_TCB_RX_DDP_BUF0_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF0_TAG(x) ((x) << S_TCB_RX_DDP_BUF0_TAG)
+
+#define W_TCB_RX_DDP_BUF1_TAG    31
+#define S_TCB_RX_DDP_BUF1_TAG    0
+#define M_TCB_RX_DDP_BUF1_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF1_TAG(x) ((x) << S_TCB_RX_DDP_BUF1_TAG)
+
+#define S_TF_DACK    10
+#define V_TF_DACK(x) ((x) << S_TF_DACK)
+
+#define S_TF_NAGLE    11
+#define V_TF_NAGLE(x) ((x) << S_TF_NAGLE)
+
+#define S_TF_RECV_SCALE    12
+#define V_TF_RECV_SCALE(x) ((x) << S_TF_RECV_SCALE)
+
+#define S_TF_RECV_TSTMP    13
+#define V_TF_RECV_TSTMP(x) ((x) << S_TF_RECV_TSTMP)
+
+#define S_TF_RECV_SACK    14
+#define V_TF_RECV_SACK(x) ((x) << S_TF_RECV_SACK)
+
+#define S_TF_TURBO    15
+#define V_TF_TURBO(x) ((x) << S_TF_TURBO)
+
+#define S_TF_KEEPALIVE    16
+#define V_TF_KEEPALIVE(x) ((x) << S_TF_KEEPALIVE)
+
+#define S_TF_TCAM_BYPASS    17
+#define V_TF_TCAM_BYPASS(x) ((x) << S_TF_TCAM_BYPASS)
+
+#define S_TF_CORE_FIN    18
+#define V_TF_CORE_FIN(x) ((x) << S_TF_CORE_FIN)
+
+#define S_TF_CORE_MORE    19
+#define V_TF_CORE_MORE(x) ((x) << S_TF_CORE_MORE)
+
+#define S_TF_MIGRATING    20
+#define V_TF_MIGRATING(x) ((x) << S_TF_MIGRATING)
+
+#define S_TF_ACTIVE_OPEN    21
+#define V_TF_ACTIVE_OPEN(x) ((x) << S_TF_ACTIVE_OPEN)
+
+#define S_TF_ASK_MODE    22
+#define V_TF_ASK_MODE(x) ((x) << S_TF_ASK_MODE)
+
+#define S_TF_NON_OFFLOAD    23
+#define V_TF_NON_OFFLOAD(x) ((x) << S_TF_NON_OFFLOAD)
+
+#define S_TF_MOD_SCHD    24
+#define V_TF_MOD_SCHD(x) ((x) << S_TF_MOD_SCHD)
+
+#define S_TF_MOD_SCHD_REASON0    25
+#define V_TF_MOD_SCHD_REASON0(x) ((x) << S_TF_MOD_SCHD_REASON0)
+
+#define S_TF_MOD_SCHD_REASON1    26
+#define V_TF_MOD_SCHD_REASON1(x) ((x) << S_TF_MOD_SCHD_REASON1)
+
+#define S_TF_MOD_SCHD_RX    27
+#define V_TF_MOD_SCHD_RX(x) ((x) << S_TF_MOD_SCHD_RX)
+
+#define S_TF_CORE_PUSH    28
+#define V_TF_CORE_PUSH(x) ((x) << S_TF_CORE_PUSH)
+
+#define S_TF_RCV_COALESCE_ENABLE    29
+#define V_TF_RCV_COALESCE_ENABLE(x) ((x) << S_TF_RCV_COALESCE_ENABLE)
+
+#define S_TF_RCV_COALESCE_PUSH    30
+#define V_TF_RCV_COALESCE_PUSH(x) ((x) << S_TF_RCV_COALESCE_PUSH)
+
+#define S_TF_RCV_COALESCE_LAST_PSH    31
+#define V_TF_RCV_COALESCE_LAST_PSH(x) ((x) << S_TF_RCV_COALESCE_LAST_PSH)
+
+#define S_TF_RCV_COALESCE_HEARTBEAT    32
+#define V_TF_RCV_COALESCE_HEARTBEAT(x) ((x) << S_TF_RCV_COALESCE_HEARTBEAT)
+
+#define S_TF_HALF_CLOSE    33
+#define V_TF_HALF_CLOSE(x) ((x) << S_TF_HALF_CLOSE)
+
+#define S_TF_DACK_MSS    34
+#define V_TF_DACK_MSS(x) ((x) << S_TF_DACK_MSS)
+
+#define S_TF_CCTRL_SEL0    35
+#define V_TF_CCTRL_SEL0(x) ((x) << S_TF_CCTRL_SEL0)
+
+#define S_TF_CCTRL_SEL1    36
+#define V_TF_CCTRL_SEL1(x) ((x) << S_TF_CCTRL_SEL1)
+
+#define S_TF_TCP_NEWRENO_FAST_RECOVERY    37
+#define V_TF_TCP_NEWRENO_FAST_RECOVERY(x) ((x) << S_TF_TCP_NEWRENO_FAST_RECOVERY)
+
+#define S_TF_TX_PACE_AUTO    38
+#define V_TF_TX_PACE_AUTO(x) ((x) << S_TF_TX_PACE_AUTO)
+
+#define S_TF_PEER_FIN_HELD    39
+#define V_TF_PEER_FIN_HELD(x) ((x) << S_TF_PEER_FIN_HELD)
+
+#define S_TF_CORE_URG    40
+#define V_TF_CORE_URG(x) ((x) << S_TF_CORE_URG)
+
+#define S_TF_RDMA_ERROR    41
+#define V_TF_RDMA_ERROR(x) ((x) << S_TF_RDMA_ERROR)
+
+#define S_TF_SSWS_DISABLED    42
+#define V_TF_SSWS_DISABLED(x) ((x) << S_TF_SSWS_DISABLED)
+
+#define S_TF_DUPACK_COUNT_ODD    43
+#define V_TF_DUPACK_COUNT_ODD(x) ((x) << S_TF_DUPACK_COUNT_ODD)
+
+#define S_TF_TX_CHANNEL    44
+#define V_TF_TX_CHANNEL(x) ((x) << S_TF_TX_CHANNEL)
+
+#define S_TF_RX_CHANNEL    45
+#define V_TF_RX_CHANNEL(x) ((x) << S_TF_RX_CHANNEL)
+
+#define S_TF_TX_PACE_FIXED    46
+#define V_TF_TX_PACE_FIXED(x) ((x) << S_TF_TX_PACE_FIXED)
+
+#define S_TF_RDMA_FLM_ERROR    47
+#define V_TF_RDMA_FLM_ERROR(x) ((x) << S_TF_RDMA_FLM_ERROR)
+
+#define S_TF_RX_FLOW_CONTROL_DISABLE    48
+#define V_TF_RX_FLOW_CONTROL_DISABLE(x) ((x) << S_TF_RX_FLOW_CONTROL_DISABLE)
+
+#endif /* _TCB_DEFS_H */


From swise at opengridcomputing.com  Sun Dec 10 14:35:15 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:35:15 -0600
Subject: [openib-general] [PATCH  v3 05/13] Queue Pairs
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223515.27166.60256.stgit@dell3.ogc.int>


Code to manipulate the QP.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++
 1 files changed, 1007 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 0000000..9f6b251
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1007 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 * flit_cnt)
+{
+	int i;
+	u32 plen;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE;
+		else
+			wqe->send.rdmaop = T3_SEND;
+		wqe->send.rem_stag = 0;
+		break;
+#if 0				/* Not currently supported */
+	case TYPE_SEND_INVALIDATE:
+	case TYPE_SEND_INVALIDATE_IMMEDIATE:
+		wqe->send.rdmaop = T3_SEND_WITH_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+	case TYPE_SEND_SE_INVALIDATE:
+		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+#endif
+	default:
+		break;
+	}
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->send.reserved[0] = 0;
+	wqe->send.reserved[1] = 0;
+	wqe->send.reserved[2] = 0;
+	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+		plen = 4;
+		wqe->send.sgl[0].stag = wr->imm_data;
+		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->send.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 5;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->send.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->send.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 4 + ((wr->num_sge) << 1);
+	}
+	wqe->send.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr,
+					u8 *flit_cnt)
+{
+	int i;
+	u32 plen;
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->write.rdmaop = T3_RDMA_WRITE;
+	wqe->write.reserved[0] = 0;
+	wqe->write.reserved[1] = 0;
+	wqe->write.reserved[2] = 0;
+	wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+	if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+		plen = 4;
+		wqe->write.sgl[0].stag = wr->imm_data;
+		wqe->write.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->write.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 6;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->write.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->write.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->write.sgl[i].to =
+			    cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->write.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 5 + ((wr->num_sge) << 1);
+	}
+	wqe->write.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 *flit_cnt)
+{
+	if (wr->num_sge > 1)
+		return -EINVAL;
+	wqe->read.rdmaop = T3_READ_REQ;
+	wqe->read.reserved[0] = 0;
+	wqe->read.reserved[1] = 0;
+	wqe->read.reserved[2] = 0;
+	wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr);
+	wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey);
+	wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length);
+	wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr);
+	*flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3;
+	return 0;
+}
+
+/* 
+ * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
+ */
+static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp,
+				   struct ib_sge *sg_list, u32 num_sgle,
+				   u32 * pbl_addr, u8 * page_size)
+{
+	int i;
+	struct iwch_mr *mhp;
+	u32 offset;
+	for (i = 0; i < num_sgle; i++) {
+
+		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
+		if (!mhp) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (!mhp->attr.state) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (mhp->attr.zbva) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+
+		if (sg_list[i].addr < mhp->attr.va_fbo) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) <
+		    sg_list[i].addr) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) >
+		    mhp->attr.va_fbo + ((u64) mhp->attr.len)) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		offset = sg_list[i].addr - mhp->attr.va_fbo;
+		offset += ((u32) mhp->attr.va_fbo) %
+		          (1UL << (12 + mhp->attr.page_size));
+		pbl_addr[i] = ((mhp->attr.pbl_addr - 
+			        rhp->rdev.rnic_info.pbl_base) >> 3) +
+			      (offset >> (12 + mhp->attr.page_size));
+		page_size[i] = mhp->attr.page_size;
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_recv(struct iwch_dev *rhp,
+						    union t3_wr *wqe,
+						    struct ib_recv_wr *wr)
+{
+	int i, err = 0;
+	u32 pbl_addr[4];
+	u8 page_size[4];
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, 
+			       page_size);
+	if (err)
+		return err;
+	wqe->recv.pagesz[0] = page_size[0];
+	wqe->recv.pagesz[1] = page_size[1];
+	wqe->recv.pagesz[2] = page_size[2];
+	wqe->recv.pagesz[3] = page_size[3];
+	wqe->recv.num_sgle = cpu_to_be32(wr->num_sge);
+	for (i = 0; i < wr->num_sge; i++) {
+		wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		
+		/* to in the WQE == the offset into the page */
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
+				(1UL << (12 + page_size[i])));
+
+		/* pbl_addr is the adapters address in the PBL */
+		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
+	}
+	for (; i < T3_MAX_SGE; i++) {
+		wqe->recv.sgl[i].stag = 0;
+		wqe->recv.sgl[i].len = 0;
+		wqe->recv.sgl[i].to = 0;
+		wqe->recv.pbl_addr[i] = 0;
+	}
+	return 0;
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr)
+{
+	int err = 0;
+	u8 t3_wr_flit_cnt;
+	enum t3_wr_opcode t3_wr_opcode = 0;
+	enum t3_wr_flags t3_wr_flags;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+		  qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	while (wr) {
+		if (num_wrs == 0) {
+			err = -ENOMEM;
+			*bad_wr = wr;
+			break;
+		}
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		t3_wr_flags = 0;
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
+		if (wr->send_flags & IB_SEND_FENCE)
+			t3_wr_flags |= T3_READ_FENCE_FLAG;
+		if (wr->send_flags & IB_SEND_SIGNALED)
+			t3_wr_flags |= T3_COMPLETION_FLAG;
+		sqp = qhp->wq.sq + 
+		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+		switch (wr->opcode) {
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_IMM:
+			t3_wr_opcode = T3_WR_SEND;
+			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			t3_wr_opcode = T3_WR_WRITE;
+			err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_READ:
+			t3_wr_opcode = T3_WR_READ;
+			t3_wr_flags = 0; /* T3 reads are always signaled */
+			err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt);
+			if (err) 
+				break;
+			sqp->read_len = wqe->read.local_len;
+			if (!qhp->wq.oldest_read)
+				qhp->wq.oldest_read = sqp;
+			break;
+		default:
+			PDBG("%s post of type=%d TBD!\n", __FUNCTION__,
+			     wr->opcode);
+			err = -EINVAL;
+		}
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+		sqp->wr_id = wr->wr_id;
+		sqp->opcode = wr2opcode(t3_wr_opcode);
+		sqp->sq_wptr = qhp->wq.sq_wptr;
+		sqp->complete = 0;
+		sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED);
+
+		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, t3_wr_flit_cnt);
+		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", 
+		     __FUNCTION__, wr->wr_id, idx, 
+		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
+		     sqp->opcode);
+		wr = wr->next;
+		num_wrs--;
+		++(qhp->wq.wptr);
+		++(qhp->wq.sq_wptr);
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	int err = 0;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, 
+			    qhp->wq.rq_size_log2) - 1;
+	if (!wr) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	while (wr) {
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		if (num_wrs)
+			err = iwch_build_rdma_recv(qhp->rhp, wqe, wr);
+		else
+			err = -ENOMEM;
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = 
+			wr->wr_id;
+		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, sizeof(struct t3_receive_wr) >> 3);
+		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
+		     "wqe %p \n", __FUNCTION__, wr->wr_id, idx, 
+		     qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
+		++(qhp->wq.rq_wptr);
+		++(qhp->wq.wptr);
+		wr = wr->next;
+		num_wrs--;
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_qp *qhp;
+	union t3_wr *wqe;
+	u32 pbl_addr;
+	u8 page_size;
+	u32 num_wrs;
+	unsigned long flag;
+	struct ib_sge sgl;
+	int err=0;
+	enum t3_wr_flags t3_wr_flags;
+	u32 idx;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(qp);
+	mhp = to_iwch_mw(mw);
+	rhp = qhp->rhp;
+
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+			    qhp->wq.sq_size_log2);
+	if ((num_wrs) <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, 
+	     mw, mw_bind);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+
+	t3_wr_flags = 0;
+	if (mw_bind->send_flags & IB_SEND_SIGNALED)
+		t3_wr_flags = T3_COMPLETION_FLAG;
+
+        sgl.addr = mw_bind->addr;
+        sgl.lkey = mw_bind->mr->lkey;
+        sgl.length = mw_bind->length;
+        wqe->bind.reserved = 0;
+        wqe->bind.type = T3_VA_BASED_TO;
+
+        /* TBD: check perms */
+        wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags);
+        wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey);
+        wqe->bind.mw_stag = cpu_to_be32(mw->rkey);
+        wqe->bind.mw_len = cpu_to_be32(mw_bind->length);
+        wqe->bind.mw_va = cpu_to_be64(mw_bind->addr);
+        err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size);
+        if (err) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+                return err;
+	}
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+	sqp->wr_id = mw_bind->wr_id;
+	sqp->opcode = T3_BIND_MW;
+	sqp->sq_wptr = qhp->wq.sq_wptr;
+	sqp->complete = 0;
+	sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED);
+        wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr);
+        wqe->bind.mr_pagesz = page_size;
+	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
+	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 
+			        sizeof(struct t3_bind_mw_wr) >> 3);
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+
+	return err;
+}
+
+static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode,
+				    int tagged)
+{
+	switch (t3err) {
+	case TPT_ERR_STAG:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_STAG;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_INV_STAG;
+		}
+		break;
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_STAG_NOT_ASSOC;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_STAG_NOT_ASSOC;
+		}
+		break;
+	case TPT_ERR_WRAP:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+		*ecode = RDMAP_TO_WRAP;
+		break;
+	case TPT_ERR_BOUND:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_BASE_BOUNDS;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_BASE_BOUNDS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_MSG_TOOBIG;
+		}
+		break;
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_CANT_INV_STAG;
+		break;
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		*layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_OUT_OF_RQE:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_NOBUF;
+		break;
+	case TPT_ERR_PBL_ADDR_BOUND:
+		*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+		*ecode = DDPT_BASE_BOUNDS;
+		break;
+	case TPT_ERR_CRC:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_CRC_ERR;
+		break;
+	case TPT_ERR_MARKER:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_MARKER_ERR;
+		break;
+	case TPT_ERR_PDU_LEN_ERR:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_MSG_TOOBIG;
+		break;
+	case TPT_ERR_DDP_VERSION:
+		if (tagged) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_VERS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_INV_VERS;
+		}
+		break;
+	case TPT_ERR_RDMA_VERSION:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_VERS;
+		break;
+	case TPT_ERR_OPCODE:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_OPCODE;
+		break;
+	case TPT_ERR_DDP_QUEUE_NUM:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_QN;
+		break;
+	case TPT_ERR_MSN:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_IRD_OVERFLOW:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_RANGE;
+		break;
+	case TPT_ERR_TBIT:
+		*layer_type = LAYER_DDP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_MO:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MO;
+		break;
+	default: 
+		*layer_type = LAYER_RDMAP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	}
+}
+
+/*
+ * This posts a TERMINATE with layer=RDMA, type=catastrophic.
+ */
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
+{
+	union t3_wr *wqe;
+	struct terminate_message *term;
+	int status;
+	int tagged = 0;
+	struct sk_buff *skb;
+
+	PDBG("%s %d\n", __FUNCTION__, __LINE__);
+	skb = alloc_skb(40, GFP_ATOMIC);
+	if (!skb) {
+		printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (union t3_wr *)skb_put(skb, 40);
+	memset(wqe, 0, 40);
+	wqe->send.rdmaop = T3_TERMINATE;
+	
+	/* immediate data length */
+	wqe->send.plen = htonl(4);
+
+	/* immediate data starts here. */
+	term = (struct terminate_message *)wqe->send.sgl;
+	if (rsp_msg) {
+		status = CQE_STATUS(rsp_msg->cqe);
+		if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)
+			tagged = 1;
+		if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) ||
+		    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP))
+			tagged = 2;
+	} else {
+		status = TPT_ERR_INTERNAL_ERR;
+	}
+	build_term_codes(status, &term->layer_etype, &term->ecode, tagged);
+	build_fw_riwrh((void *)wqe, T3_WR_SEND, 
+		       T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, 
+		       qhp->ep->hwtid, 5);
+	skb->priority = CPL_PRIORITY_DATA;
+	return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb));
+}
+
+/*
+ * Assumes qhp lock is held.
+ */
+static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	struct iwch_cq *rchp, *schp;
+	int count;
+
+	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
+	schp = get_chp(qhp->rhp, qhp->attr.scq);
+	
+	PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp);
+	/* take a ref on the qhp since we must release the lock */
+	atomic_inc(&qhp->refcnt);
+	spin_unlock_irqrestore(&qhp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&rchp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&rchp->cq);
+	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
+	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&rchp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&schp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&schp->cq);
+	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
+	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&schp->lock, *flag);
+
+	/* deref */
+	if (atomic_dec_and_test(&qhp->refcnt))
+                wake_up(&qhp->wait);
+
+	spin_lock_irqsave(&qhp->lock, *flag);
+}
+
+static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	if (t3b_device(qhp->rhp))
+		cxio_set_wq_in_error(&qhp->wq);
+	else
+		__flush_qp(qhp, flag);
+}
+
+
+/* 
+ * Return non zero if at least one RECV was pre-posted.
+ */
+static inline int rqes_posted(struct iwch_qp *qhp)
+{ 
+	return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV);
+}
+
+static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs)
+{
+	struct t3_rdma_init_attr init_attr;
+	int ret;
+
+	init_attr.tid = qhp->ep->hwtid;
+	init_attr.qpid = qhp->wq.qpid;
+	init_attr.pdid = qhp->attr.pd;
+	init_attr.scqid = qhp->attr.scq;
+	init_attr.rcqid = qhp->attr.rcq;
+	init_attr.rq_addr = qhp->wq.rq_addr;
+	init_attr.rq_size = 1 << qhp->wq.rq_size_log2;
+	init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | 
+		qhp->attr.mpa_attr.recv_marker_enabled |
+		(qhp->attr.mpa_attr.xmit_marker_enabled << 1) |
+		(qhp->attr.mpa_attr.crc_enabled << 2);
+
+	/* 
+	 * XXX - The IWCM doesn't quite handle getting these
+ 	 * attrs set before going into RTS.  For now, just turn 
+	 * them on always...
+	 */
+#if 0
+	init_attr.qpcaps = qhp->attr.enableRdmaRead |
+		(qhp->attr.enableRdmaWrite << 1) |
+		(qhp->attr.enableBind << 2) |
+		(qhp->attr.enable_stag0_fastreg << 3) |
+		(qhp->attr.enable_stag0_fastreg << 4);
+#else
+	init_attr.qpcaps = 0x1f;
+#endif
+	init_attr.tcp_emss = qhp->ep->emss;
+	init_attr.ord = qhp->attr.max_ord;
+	init_attr.ird = qhp->attr.max_ird;
+	init_attr.qp_dma_addr = qhp->wq.dma_addr;
+	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
+	init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0;
+	PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d "
+	     "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, 
+	     init_attr.rq_addr, init_attr.rq_size, 
+	     init_attr.flags, init_attr.qpcaps);
+	ret = cxio_rdma_init(&rhp->rdev, &init_attr);
+	PDBG("%s ret %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal)
+{
+	int ret = 0;
+	struct iwch_qp_attributes newattr = qhp->attr;
+	unsigned long flag;
+	int disconnect = 0;
+	int terminate = 0;
+	int abort = 0;
+	int free = 0;
+	struct iwch_ep *ep = NULL;
+
+	PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, 
+	     qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, 
+	     (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1);
+
+	spin_lock_irqsave(&qhp->lock, flag);
+
+	/* Process attr changes if in IDLE */
+	if (mask & IWCH_QP_ATTR_VALID_MODIFY) {
+		if (qhp->attr.state != IWCH_QP_STATE_IDLE) {
+			ret = -EIO;
+			goto out;
+		}
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ)
+			newattr.enable_rdma_read = attrs->enable_rdma_read;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE)
+			newattr.enable_rdma_write = attrs->enable_rdma_write;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND)
+			newattr.enable_bind = attrs->enable_bind;
+		if (mask & IWCH_QP_ATTR_MAX_ORD) {
+			if (attrs->max_ord > 
+			    rhp->attr.max_rdma_read_qp_depth) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ord = attrs->max_ord;
+		}
+		if (mask & IWCH_QP_ATTR_MAX_IRD) {
+			if (attrs->max_ird > 
+		  	    rhp->attr.max_rdma_reads_per_qp) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ird = attrs->max_ird;
+		}
+		qhp->attr = newattr;
+	}
+	
+	if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) 
+		goto out;
+	if (qhp->attr.state == attrs->next_state)
+		goto out;
+
+	switch (qhp->attr.state) {
+	case IWCH_QP_STATE_IDLE:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_RTS: 
+			if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			qhp->attr.mpa_attr = attrs->mpa_attr;
+			qhp->attr.llp_stream_handle = attrs->llp_stream_handle;
+			qhp->ep = qhp->attr.llp_stream_handle;
+			qhp->attr.state = IWCH_QP_STATE_RTS;
+
+			/*
+			 * Ref the endpoint here and deref when we
+	 		 * disassociate the endpoint from the QP.  This
+			 * happens in CLOSING->IDLE transition or *->ERROR
+			 * transition.
+			 */
+			get_ep(&qhp->ep->com);
+			spin_unlock_irqrestore(&qhp->lock, flag);
+			ret = rdma_init(rhp, qhp, mask, attrs);
+			spin_lock_irqsave(&qhp->lock, flag);
+			if (ret)
+				goto err;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			flush_qp(qhp, &flag);
+			break;
+		default:
+			ret = -EINVAL;	
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_RTS:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_CLOSING:
+			BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2);
+			qhp->attr.state = IWCH_QP_STATE_CLOSING;
+			if (!internal) {
+				abort=0;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			break;
+		case IWCH_QP_STATE_TERMINATE:
+			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
+			if (!internal) 
+				terminate = 1;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			if (!internal) {
+				abort=1;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			goto err;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_CLOSING:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (attrs->next_state) {
+			case IWCH_QP_STATE_IDLE:
+				qhp->attr.state = IWCH_QP_STATE_IDLE;
+				qhp->attr.llp_stream_handle = NULL;
+				put_ep(&qhp->ep->com);
+				qhp->ep = NULL;
+				wake_up(&qhp->wait);
+				break;
+			case IWCH_QP_STATE_ERROR:
+				goto err;
+			default:
+				ret = -EINVAL;
+				goto err;
+		}
+		break;
+	case IWCH_QP_STATE_ERROR:
+		if (attrs->next_state != IWCH_QP_STATE_IDLE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		
+		if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || 
+		    !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		qhp->attr.state = IWCH_QP_STATE_IDLE;
+		memset(&qhp->attr, 0, sizeof(qhp->attr));
+		break;
+	case IWCH_QP_STATE_TERMINATE:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		goto err;
+		break;
+	default:
+		printk(KERN_ERR "%s in a bad state %d\n", 
+		       __FUNCTION__, qhp->attr.state);
+		ret = -EINVAL;
+		goto err;
+		break;
+	}
+	goto out;
+err:
+	PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, 
+	     qhp->wq.qpid);
+
+	/* disassociate the LLP connection */
+	qhp->attr.llp_stream_handle = NULL;
+	ep = qhp->ep;
+	qhp->ep = NULL;
+	qhp->attr.state = IWCH_QP_STATE_ERROR;
+	free=1;
+	wake_up(&qhp->wait);
+	BUG_ON(!ep);
+	flush_qp(qhp, &flag);
+out:
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	if (terminate)
+		iwch_post_terminate(qhp, NULL);
+
+	/*
+	 * If disconnect is 1, then we need to initiate a disconnect
+	 * on the EP.  This can be a normal close (RTS->CLOSING) or
+	 * an abnormal close (RTS/CLOSING->ERROR).
+	 */
+	if (disconnect)
+		iwch_ep_disconnect(ep, abort, GFP_KERNEL);
+
+	/* 
+	 * If free is 1, then we've disassociated the EP from the QP 
+	 * and we need to dereference the EP.
+	 */
+	if (free)
+		put_ep(&ep->com);
+
+	PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state);
+	return ret;
+}
+
+static int quiesce_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_quiesce_tid(qhp->ep);
+	qhp->flags |= QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+static int resume_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_resume_tid(qhp->ep);
+	qhp->flags &= ~QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+int iwch_quiesce_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) {
+			quiesce_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) 
+			quiesce_qp(qhp);
+	}
+	return 0;
+}
+
+int iwch_resume_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) {
+			resume_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp))
+			resume_qp(qhp);
+	}
+	return 0;
+}


From swise at opengridcomputing.com  Sun Dec 10 14:35:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:35:45 -0600
Subject: [openib-general] [PATCH  v3 06/13] Completion Queues
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223545.27166.81531.stgit@dell3.ogc.int>


Functions to manipulate CQs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 0000000..9d82df4
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 	0 			EMPTY;
+ *	1			cqe returned
+ *	-EAGAIN 		caller must try again
+ * 	any other -errno	fatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+		     struct ib_wc *wc)
+{
+	struct iwch_qp *qhp = NULL;
+	struct t3_cqe cqe, *rd_cqe;
+	struct t3_wq *wq;
+	u32 credit = 0;
+	u8 cqe_flushed;
+	u64 cookie;
+	int ret = 1;
+
+	rd_cqe = cxio_next_cqe(&chp->cq);
+
+	if (!rd_cqe)
+		return 0;
+
+	qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+	if (!qhp)
+		wq = NULL;
+	else {
+		spin_lock(&qhp->lock);
+		wq = &(qhp->wq);
+	}
+	ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie,
+				   &credit);
+	if (t3a_device(chp->rhp) && credit) {
+		PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, 
+		     credit, chp->cq.cqid);
+		cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit);
+	}
+
+	if (ret) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = 1;
+
+	wc->wr_id = cookie;
+	wc->qp_num = qhp->wq.qpid;
+	wc->vendor_err = CQE_STATUS(cqe);
+
+	PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x "
+	     "lo 0x%x cookie 0x%llx\n", __FUNCTION__, 
+	     CQE_QPID(cqe), CQE_TYPE(cqe),
+	     CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+	     CQE_WRID_LOW(cqe), cookie);
+
+	if (CQE_TYPE(cqe) == 0) {
+		if (!CQE_STATUS(cqe))
+			wc->byte_len = CQE_LEN(cqe);
+		else
+			wc->byte_len = 0;
+		wc->opcode = IB_WC_RECV;
+	} else {
+		switch (CQE_OPCODE(cqe)) {
+		case T3_RDMA_WRITE:
+			wc->opcode = IB_WC_RDMA_WRITE;
+			break;
+		case T3_READ_REQ:
+			wc->opcode = IB_WC_RDMA_READ;
+			wc->byte_len = CQE_LEN(cqe);
+			break;
+		case T3_SEND:
+		case T3_SEND_WITH_SE:
+			wc->opcode = IB_WC_SEND;
+			break;
+		case T3_BIND_MW:
+			wc->opcode = IB_WC_BIND_MW;
+			break;
+
+		/* these aren't supported yet */
+		case T3_SEND_WITH_INV:
+		case T3_SEND_WITH_SE_INV:
+		case T3_LOCAL_INV:
+		case T3_FAST_REGISTER:
+		default:
+			printk(KERN_ERR MOD "Unexpected opcode %d "
+			       "in the CQE received for QPID=0x%0x\n", 
+			       CQE_OPCODE(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (cqe_flushed)
+		wc->status = IB_WC_WR_FLUSH_ERR;
+	else {
+		
+		switch (CQE_STATUS(cqe)) {
+		case TPT_ERR_SUCCESS:
+			wc->status = IB_WC_SUCCESS;
+			break;
+		case TPT_ERR_STAG:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_PDID:
+			wc->status = IB_WC_LOC_PROT_ERR;
+			break;
+		case TPT_ERR_QPID:
+		case TPT_ERR_ACCESS:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_WRAP:
+			wc->status = IB_WC_GENERAL_ERR;
+			break;
+		case TPT_ERR_BOUND:
+			wc->status = IB_WC_LOC_LEN_ERR;
+			break;
+		case TPT_ERR_INVALIDATE_SHARED_MR:
+		case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+			wc->status = IB_WC_MW_BIND_ERR;
+			break;
+		case TPT_ERR_CRC:
+		case TPT_ERR_MARKER:
+		case TPT_ERR_PDU_LEN_ERR:
+		case TPT_ERR_OUT_OF_RQE:
+		case TPT_ERR_DDP_VERSION:
+		case TPT_ERR_RDMA_VERSION:
+		case TPT_ERR_DDP_QUEUE_NUM:
+		case TPT_ERR_MSN:
+		case TPT_ERR_TBIT:
+		case TPT_ERR_MO:
+		case TPT_ERR_MSN_RANGE:
+		case TPT_ERR_IRD_OVERFLOW:
+		case TPT_ERR_OPCODE:
+			wc->status = IB_WC_FATAL_ERR;
+			break;
+		case TPT_ERR_SWFLUSH:
+			wc->status = IB_WC_WR_FLUSH_ERR;
+			break;
+		default:
+			printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for "
+			       "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+		}
+	}
+out:
+	if (wq)
+		spin_unlock(&qhp->lock);
+	return ret;
+}
+
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	unsigned long flags;
+	int npolled;
+	int err = 0;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+
+	spin_lock_irqsave(&chp->lock, flags);
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+#ifdef DEBUG
+		int i=0;
+#endif
+
+		/*
+	 	 * Because T3 can post CQEs that are _not_ associated
+	 	 * with a WR, we might have to poll again after removing
+	 	 * one of these.  
+		 */
+		do {
+			err = iwch_poll_cq_one(rhp, chp, wc + npolled);
+#ifdef DEBUG
+			BUG_ON(++i > 1000);
+#endif
+		} while (err == -EAGAIN);
+		if (err <= 0)
+			break;
+	}
+	spin_unlock_irqrestore(&chp->lock, flags);
+
+	if (err < 0)
+		return err;
+	else {
+		return npolled;
+	}
+}
+
+int iwch_modify_cq(struct ib_cq *cq, int cqe)
+{
+	PDBG("iwch_modify_cq: TBD\n");
+	return 0;
+}


From swise at opengridcomputing.com  Sun Dec 10 14:36:15 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:36:15 -0600
Subject: [openib-general] [PATCH  v3 07/13] Async Event Handler
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223615.27166.4800.stgit@dell3.ogc.int>


Code to handle async events coming from the T3 RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 0000000..b0bd014
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <net/sock.h>
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+			  struct respQ_msg_t *rsp_msg,
+			  enum ib_event_type ib_event, 
+			  int send_term)
+{
+	struct ib_event event;
+	struct iwch_qp_attributes attrs;
+	struct iwch_qp *qhp;
+
+	printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+	       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+	       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+	       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+	       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+	spin_lock(&rnicp->lock);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+
+	if (!qhp) {
+		printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", 
+		       __FUNCTION__, CQE_STATUS(rsp_msg->cqe), 
+		       CQE_QPID(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	if ((qhp->attr.state == IWCH_QP_STATE_ERROR) ||
+	    (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) {
+		PDBG("%s AE received after RTS - "
+		     "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, 
+		     qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	atomic_inc(&qhp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	event.event = ib_event;
+	event.device = chp->ibcq.device;
+	if (ib_event == IB_EVENT_CQ_ERR)
+		event.element.cq = &chp->ibcq;
+	else 
+		event.element.qp = &qhp->ibqp;
+
+	if (qhp->ibqp.event_handler)
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_TERMINATE;
+		iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, 
+			       &attrs, 1);
+		if (send_term)
+			iwch_post_terminate(qhp, rsp_msg);
+	} 
+
+	if (atomic_dec_and_test(&qhp->refcnt))
+		wake_up(&qhp->wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+	struct iwch_dev *rnicp;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	struct iwch_cq *chp;
+	struct iwch_qp *qhp;
+	u32 cqid = RSPQ_CQID(rsp_msg);
+
+	rnicp = (struct iwch_dev *) rdev_p->ulp;
+	spin_lock(&rnicp->lock);
+	chp = get_chp(rnicp, cqid);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+	if (!chp || !qhp) {
+		printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d "
+		       "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", 
+		       cqid, CQE_QPID(rsp_msg->cqe), 
+		       CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+		       CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), 
+		       CQE_WRID_LOW(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		goto out;
+	}
+	iwch_qp_add_ref(&qhp->ibqp);
+	atomic_inc(&chp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	/* 
+	 * 1) completion of our sending a TERMINATE.
+	 * 2) incoming TERMINATE message.  
+	 */
+	if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && 
+	    (CQE_STATUS(rsp_msg->cqe) == 0)) {
+		if (SQ_TYPE(rsp_msg->cqe)) {
+			PDBG("%s QPID 0x%x ep %p disconnecting\n", 
+			     __FUNCTION__, qhp->wq.qpid, qhp->ep);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		} else {
+			PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, 
+			     qhp->wq.qpid);
+			post_qp_event(rnicp, chp, rsp_msg, 
+				      IB_EVENT_QP_REQ_ERR, 0);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		}
+		goto done;
+	}
+
+	/* Bad incoming Read request */
+	if (SQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	/* Bad incoming write */
+	if (RQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	switch (CQE_STATUS(rsp_msg->cqe)) {
+
+	/* Completion Events */
+	case TPT_ERR_SUCCESS:
+
+		/* 
+		 * Confirm the destination entry if this is a RECV completion.
+		 */
+		if (qhp->ep && SQ_TYPE(rsp_msg->cqe))
+			dst_confirm(qhp->ep->dst);
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		break;
+
+	case TPT_ERR_STAG:
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+	case TPT_ERR_WRAP:
+	case TPT_ERR_BOUND:
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x "
+		       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+		       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+		       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+		       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1);
+		break;
+
+	/* Device Fatal Errors */
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1);
+		break;
+	
+	/* QP Fatal Errors */
+	case TPT_ERR_OUT_OF_RQE:
+	case TPT_ERR_PBL_ADDR_BOUND:
+	case TPT_ERR_CRC:
+	case TPT_ERR_MARKER:
+	case TPT_ERR_PDU_LEN_ERR:
+	case TPT_ERR_DDP_VERSION:
+	case TPT_ERR_RDMA_VERSION:
+	case TPT_ERR_OPCODE:
+	case TPT_ERR_DDP_QUEUE_NUM:
+	case TPT_ERR_MSN:
+	case TPT_ERR_TBIT:
+	case TPT_ERR_MO:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_RQE_ADDR_BOUND:
+	case TPT_ERR_IRD_OVERFLOW:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+
+	default:
+		printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", 
+		       CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+	}
+done:
+	if (atomic_dec_and_test(&chp->refcnt))
+                wake_up(&chp->wait);
+	iwch_qp_rem_ref(&qhp->ibqp);
+out:
+	dev_kfree_skb_irq(skb);
+}


From swise at opengridcomputing.com  Sun Dec 10 14:36:45 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:36:45 -0600
Subject: [openib-general] [PATCH  v3 08/13] Memory Registration
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223645.27166.44081.stgit@dell3.ogc.int>


Functions to register memory regions.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  170 ++++++++++++++++++++++++++++++++
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 0000000..774d11e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	if (cxio_register_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	/* We could support this... */
+	if (npages > mhp->attr.pbl_size)
+		return -ENOMEM;
+
+	stag = mhp->attr.stag;
+	if (cxio_reregister_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list)
+{
+	u64 mask;
+	int i, j, n;
+
+	mask = 0;
+	*total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+			return -EINVAL;
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return -EINVAL;
+		*total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	if (*total_size > 0xFFFFFFFFULL)
+		return -ENOMEM;
+
+	/* Find largest page shift we can use to cover buffers */
+	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
+		if (num_phys_buf > 1) {
+			if ((1ULL << *shift) & mask)
+				break;
+		} else 
+			if (1ULL << *shift >=
+			    buffer_list[0].size +
+			    (buffer_list[0].addr & ((1ULL << *shift) - 1)))
+				break;
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
+	buffer_list[0].addr &= ~0ull << *shift;
+
+	*npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		*npages += (buffer_list[i].size + 
+			(1ULL << *shift) - 1) >> *shift;
+
+	if (!*npages)
+		return -EINVAL;
+
+	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
+	if (!*page_list)
+		return -ENOMEM;
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
+		     ++j) 
+			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
+			    ((u64) j << *shift));
+
+	PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n",
+	     __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages);
+
+	return 0;
+
+}


From swise at opengridcomputing.com  Sun Dec 10 14:37:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:37:16 -0600
Subject: [openib-general] [PATCH  v3 09/13] Core WQE/CQE Types
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223715.27166.81773.stgit@dell3.ogc.int>


T3 WQE and CQE structures, defines, etc...

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  685 ++++++++++++++++++++++++++++
 1 files changed, 685 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 0000000..45870be
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,685 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include <asm/io.h>
+#include <linux/pci.h>
+#include <linux/timer.h>
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE      4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+				       ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<<size_log2)-((wptr)-(rptr)))
+#define Q_COUNT(rptr,wptr) ((wptr)-(rptr))
+#define Q_PTR2IDX(ptr,size_log2) (ptr & ((1UL<<size_log2)-1))
+
+static inline void ring_doorbell(void __iomem *doorbell, u32 qpid) 
+{
+	writel(((1<<31) | qpid), doorbell);
+}
+
+#define SEQ32_GE(x,y) (!( (((u32) (x)) - ((u32) (y))) & 0x80000000 ))
+
+enum t3_wr_flags {
+	T3_COMPLETION_FLAG = 0x01,
+	T3_NOTIFY_FLAG = 0x02,
+	T3_SOLICITED_EVENT_FLAG = 0x04,
+	T3_READ_FENCE_FLAG = 0x08,
+	T3_LOCAL_FENCE_FLAG = 0x10
+} __attribute__ ((packed));
+
+enum t3_wr_opcode {
+	T3_WR_BP = FW_WROPCODE_RI_BYPASS,
+	T3_WR_SEND = FW_WROPCODE_RI_SEND,
+	T3_WR_WRITE = FW_WROPCODE_RI_RDMA_WRITE,
+	T3_WR_READ = FW_WROPCODE_RI_RDMA_READ,
+	T3_WR_INV_STAG = FW_WROPCODE_RI_LOCAL_INV,
+	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
+	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
+	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+} __attribute__ ((packed));
+
+enum t3_rdma_opcode {
+	T3_RDMA_WRITE,		/* IETF RDMAP v1.0 ... */
+	T3_READ_REQ,
+	T3_READ_RESP,
+	T3_SEND,
+	T3_SEND_WITH_INV,
+	T3_SEND_WITH_SE,
+	T3_SEND_WITH_SE_INV,
+	T3_TERMINATE,
+	T3_RDMA_INIT,		/* CHELSIO RI specific ... */
+	T3_BIND_MW,
+	T3_FAST_REGISTER,
+	T3_LOCAL_INV,
+	T3_QP_MOD,
+	T3_BYPASS
+} __attribute__ ((packed));
+
+static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
+{
+	switch (wrop) {
+		case T3_WR_BP: return T3_BYPASS;
+		case T3_WR_SEND: return T3_SEND;
+		case T3_WR_WRITE: return T3_RDMA_WRITE;
+		case T3_WR_READ: return T3_READ_REQ;
+		case T3_WR_INV_STAG: return T3_LOCAL_INV;
+		case T3_WR_BIND: return T3_BIND_MW;
+		case T3_WR_INIT: return T3_RDMA_INIT;
+		case T3_WR_QP_MOD: return T3_QP_MOD;
+		default: break;
+	}
+	return -1;
+}
+
+
+/* Work request id */
+union t3_wrid {
+	struct {
+		u32 hi;
+		u32 low;
+	} id0;
+	u64 id1;
+};
+
+#define WRID(wrid)      	(wrid.id1)
+#define WRID_GEN(wrid)		(wrid.id0.wr_gen)
+#define WRID_IDX(wrid)		(wrid.id0.wr_idx)
+#define WRID_LO(wrid)		(wrid.id0.wr_lo)
+
+struct fw_riwrh {
+	__be32 op_seop_flags;
+	__be32 gen_tid_len;
+};
+
+#define S_FW_RIWR_OP		24
+#define M_FW_RIWR_OP		0xff
+#define V_FW_RIWR_OP(x)		((x) << S_FW_RIWR_OP)
+#define G_FW_RIWR_OP(x)   	((((x) >> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP	22
+#define M_FW_RIWR_SOPEOP	0x3
+#define V_FW_RIWR_SOPEOP(x)	((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS		8
+#define M_FW_RIWR_FLAGS		0x3fffff
+#define V_FW_RIWR_FLAGS(x)	((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x)   	((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID		8
+#define V_FW_RIWR_TID(x)	((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN		0
+#define V_FW_RIWR_LEN(x)	((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN           31
+#define V_FW_RIWR_GEN(x)        ((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+	__be32 stag;
+	__be32 len;
+	__be64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;	
+	__be32 plen;		/* 3 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
+};
+
+struct t3_local_inv_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 stag;		/* 2 */
+	__be32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 stag_sink;
+	__be64 to_sink;		/* 3 */
+	__be32 plen;		/* 4 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 5+ */
+};
+
+struct t3_rdma_read_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;
+	__be64 rem_to;		/* 3 */
+	__be32 local_stag;	/* 4 */
+	__be32 local_len;
+	__be64 local_to;	/* 5 */
+};
+
+enum t3_addr_type {
+	T3_VA_BASED_TO = 0x0,
+	T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+	T3_MEM_ACCESS_LOCAL_READ = 0x1,
+	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+	T3_MEM_ACCESS_REM_READ = 0x4,
+	T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u16 reserved;		/* 2 */
+	u8 type;
+	u8 perms;
+	__be32 mr_stag;
+	__be32 mw_stag;		/* 3 */
+	__be32 mw_len;
+	__be64 mw_va;		/* 4 */
+	__be32 mr_pbl_addr;	/* 5 */
+	u8 reserved2[3];
+	u8 mr_pagesz;
+};
+
+struct t3_receive_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 pagesz[T3_MAX_SGE];
+	__be32 num_sgle;		/* 2 */
+	struct t3_sge sgl[T3_MAX_SGE];	/* 3+ */
+	__be32 pbl_addr[T3_MAX_SGE];
+};
+
+struct t3_bypass_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;	/* 1 */
+};
+
+struct t3_modify_qp_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 flags;		/* 2 */
+	__be32 quiesce;		/* 2 */
+	__be32 max_ird;		/* 3 */
+	__be32 max_ord;		/* 3 */
+	__be64 sge_cmd;		/* 4 */
+	__be64 ctx1;		/* 5 */
+	__be64 ctx0;		/* 6 */
+};
+
+enum t3_modify_qp_flags {
+	MODQP_QUIESCE  = 0x01,
+	MODQP_MAX_IRD  = 0x02,
+	MODQP_MAX_ORD  = 0x04,
+	MODQP_WRITE_EC = 0x08,
+	MODQP_READ_EC  = 0x10,
+};
+	
+
+enum t3_mpa_attrs {
+	uP_RI_MPA_RX_MARKER_ENABLE = 0x1,
+	uP_RI_MPA_TX_MARKER_ENABLE = 0x2,
+	uP_RI_MPA_CRC_ENABLE = 0x4,
+	uP_RI_MPA_IETF_ENABLE = 0x8
+} __attribute__ ((packed));
+
+enum t3_qp_caps {
+	uP_RI_QP_RDMA_READ_ENABLE = 0x01,
+	uP_RI_QP_RDMA_WRITE_ENABLE = 0x02,
+	uP_RI_QP_BIND_ENABLE = 0x04,
+	uP_RI_QP_FAST_REGISTER_ENABLE = 0x08,
+	uP_RI_QP_STAG0_ENABLE = 0x10
+} __attribute__ ((packed));
+
+struct t3_rdma_init_attr {
+	u32 tid;
+	u32 qpid;
+	u32 pdid;
+	u32 scqid;
+	u32 rcqid;
+	u32 rq_addr;
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs;
+	enum t3_qp_caps qpcaps;
+	u16 tcp_emss;
+	u32 ord;
+	u32 ird;
+	u64 qp_dma_addr;
+	u32 qp_dma_size;
+	u32 flags;
+};
+
+struct t3_rdma_init_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 qpid;		/* 2 */
+	__be32 pdid;
+	__be32 scqid;		/* 3 */
+	__be32 rcqid;
+	__be32 rq_addr;		/* 4 */
+	__be32 rq_size;
+	u8 mpaattrs;		/* 5 */
+	u8 qpcaps;
+	__be16 ulpdu_size;
+	__be32 flags;		/* bits 31-1 - reservered */
+				/* bit     0 - set if RECV posted */
+	__be32 ord;		/* 6 */
+	__be32 ird;
+	__be64 qp_dma_addr;	/* 7 */
+	__be32 qp_dma_size;	/* 8 */
+	u32 rsvd;
+};
+
+struct t3_genbit {
+	u64 flit[15];
+	__be64 genbit;
+};
+
+enum rdma_init_wr_flags {
+	RECVS_POSTED = 1,
+};
+
+union t3_wr {
+	struct t3_send_wr send;
+	struct t3_rdma_write_wr write;
+	struct t3_rdma_read_wr read;
+	struct t3_receive_wr recv;
+	struct t3_local_inv_wr local_inv;
+	struct t3_bind_mw_wr bind;
+	struct t3_bypass_wr bypass;
+	struct t3_rdma_init_wr init;
+	struct t3_modify_qp_wr qp_mod;
+	struct t3_genbit genbit;
+	u64 flit[16];
+};
+
+#define T3_SQ_CQE_FLIT 	  13
+#define T3_SQ_COOKIE_FLIT 14
+
+#define T3_RQ_COOKIE_FLIT 13
+#define T3_RQ_CQE_FLIT 	  14
+
+static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
+{
+	return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags));
+}
+
+static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
+				  enum t3_wr_flags flags, u8 genbit, u32 tid,
+				  u8 len)
+{
+	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
+					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_FLAGS(flags));
+	wmb();
+	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
+				       V_FW_RIWR_TID(tid) |
+				       V_FW_RIWR_LEN(len));
+	/* 2nd gen bit... */
+        ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit);
+}
+
+/*
+ * T3 ULP2_TX commands
+ */
+enum t3_utx_mem_op {
+	T3_UTX_MEM_READ = 2,
+	T3_UTX_MEM_WRITE = 3
+};
+
+/* T3 MC7 RDMA TPT entry format */
+
+enum tpt_mem_type {
+	TPT_NON_SHARED_MR = 0x0,
+	TPT_SHARED_MR = 0x1,
+	TPT_MW = 0x2,
+	TPT_MW_RELAXED_PROTECTION = 0x3
+};
+
+enum tpt_addr_type {
+	TPT_ZBTO = 0,
+	TPT_VATO = 1
+};
+
+enum tpt_mem_perm {
+	TPT_LOCAL_READ = 0x8,
+	TPT_LOCAL_WRITE = 0x4,
+	TPT_REMOTE_READ = 0x2,
+	TPT_REMOTE_WRITE = 0x1
+};
+
+struct tpt_entry {
+	__be32 valid_stag_pdid;
+	__be32 flags_pagesize_qpid;
+
+	__be32 rsvd_pbl_addr;
+	__be32 len;
+	__be32 va_hi;
+	__be32 va_low_or_fbo;
+
+	__be32 rsvd_bind_cnt_or_pstag;
+	__be32 rsvd_pbl_size;
+};
+
+#define S_TPT_VALID		31
+#define V_TPT_VALID(x)		((x) << S_TPT_VALID)
+#define F_TPT_VALID		V_TPT_VALID(1U)
+
+#define S_TPT_STAG_KEY		23
+#define M_TPT_STAG_KEY		0xFF
+#define V_TPT_STAG_KEY(x)	((x) << S_TPT_STAG_KEY)
+#define G_TPT_STAG_KEY(x)	(((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY)
+
+#define S_TPT_STAG_STATE	22
+#define V_TPT_STAG_STATE(x)	((x) << S_TPT_STAG_STATE)
+#define F_TPT_STAG_STATE	V_TPT_STAG_STATE(1U)
+
+#define S_TPT_STAG_TYPE		20
+#define M_TPT_STAG_TYPE		0x3
+#define V_TPT_STAG_TYPE(x)	((x) << S_TPT_STAG_TYPE)
+#define G_TPT_STAG_TYPE(x)	(((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE)
+
+#define S_TPT_PDID		0
+#define M_TPT_PDID		0xFFFFF
+#define V_TPT_PDID(x)		((x) << S_TPT_PDID)
+#define G_TPT_PDID(x)		(((x) >> S_TPT_PDID) & M_TPT_PDID)
+
+#define S_TPT_PERM		28
+#define M_TPT_PERM		0xF
+#define V_TPT_PERM(x)		((x) << S_TPT_PERM)
+#define G_TPT_PERM(x)		(((x) >> S_TPT_PERM) & M_TPT_PERM)
+
+#define S_TPT_REM_INV_DIS	27
+#define V_TPT_REM_INV_DIS(x)	((x) << S_TPT_REM_INV_DIS)
+#define F_TPT_REM_INV_DIS	V_TPT_REM_INV_DIS(1U)
+
+#define S_TPT_ADDR_TYPE		26
+#define V_TPT_ADDR_TYPE(x)	((x) << S_TPT_ADDR_TYPE)
+#define F_TPT_ADDR_TYPE		V_TPT_ADDR_TYPE(1U)
+
+#define S_TPT_MW_BIND_ENABLE	25
+#define V_TPT_MW_BIND_ENABLE(x)	((x) << S_TPT_MW_BIND_ENABLE)
+#define F_TPT_MW_BIND_ENABLE    V_TPT_MW_BIND_ENABLE(1U)
+
+#define S_TPT_PAGE_SIZE		20
+#define M_TPT_PAGE_SIZE		0x1F
+#define V_TPT_PAGE_SIZE(x)	((x) << S_TPT_PAGE_SIZE)
+#define G_TPT_PAGE_SIZE(x)	(((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE)
+
+#define S_TPT_PBL_ADDR		0
+#define M_TPT_PBL_ADDR		0x1FFFFFFF
+#define V_TPT_PBL_ADDR(x)	((x) << S_TPT_PBL_ADDR)
+#define G_TPT_PBL_ADDR(x)       (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR)
+
+#define S_TPT_QPID		0
+#define M_TPT_QPID		0xFFFFF
+#define V_TPT_QPID(x)		((x) << S_TPT_QPID)
+#define G_TPT_QPID(x)		(((x) >> S_TPT_QPID) & M_TPT_QPID)
+
+#define S_TPT_PSTAG		0
+#define M_TPT_PSTAG		0xFFFFFF
+#define V_TPT_PSTAG(x)		((x) << S_TPT_PSTAG)
+#define G_TPT_PSTAG(x)		(((x) >> S_TPT_PSTAG) & M_TPT_PSTAG)
+
+#define S_TPT_PBL_SIZE		0
+#define M_TPT_PBL_SIZE		0xFFFFF
+#define V_TPT_PBL_SIZE(x)	((x) << S_TPT_PBL_SIZE)
+#define G_TPT_PBL_SIZE(x)	(((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE)
+
+/*
+ * CQE defs
+ */
+struct t3_cqe {
+	__be32 header;
+	__be32 len;
+	union {
+		struct {
+			__be32 stag;
+			__be32 msn;
+		} rcqe;
+		struct {
+			u32 wrid_hi;	
+			u32 wrid_low;
+		} scqe;
+	} u;
+};
+
+#define S_CQE_OOO	  31
+#define M_CQE_OOO	  0x1
+#define G_CQE_OOO(x)	  ((((x) >> S_CQE_OOO)) & M_CQE_OOO)
+#define V_CEQ_OOO(x)	  ((x)<<S_CQE_OOO)
+
+#define S_CQE_QPID        12
+#define M_CQE_QPID        0x7FFFF
+#define G_CQE_QPID(x)     ((((x) >> S_CQE_QPID)) & M_CQE_QPID)
+#define V_CQE_QPID(x) 	  ((x)<<S_CQE_QPID)
+
+#define S_CQE_SWCQE       11
+#define M_CQE_SWCQE       0x1
+#define G_CQE_SWCQE(x)    ((((x) >> S_CQE_SWCQE)) & M_CQE_SWCQE)
+#define V_CQE_SWCQE(x) 	  ((x)<<S_CQE_SWCQE)
+
+#define S_CQE_GENBIT      10
+#define M_CQE_GENBIT      0x1
+#define G_CQE_GENBIT(x)   (((x) >> S_CQE_GENBIT) & M_CQE_GENBIT)
+#define V_CQE_GENBIT(x)	  ((x)<<S_CQE_GENBIT)
+
+#define S_CQE_STATUS      5
+#define M_CQE_STATUS      0x1F
+#define G_CQE_STATUS(x)   ((((x) >> S_CQE_STATUS)) & M_CQE_STATUS)
+#define V_CQE_STATUS(x)   ((x)<<S_CQE_STATUS)
+
+#define S_CQE_TYPE        4
+#define M_CQE_TYPE        0x1
+#define G_CQE_TYPE(x)     ((((x) >> S_CQE_TYPE)) & M_CQE_TYPE)
+#define V_CQE_TYPE(x)     ((x)<<S_CQE_TYPE)
+
+#define S_CQE_OPCODE      0
+#define M_CQE_OPCODE      0xF
+#define G_CQE_OPCODE(x)   ((((x) >> S_CQE_OPCODE)) & M_CQE_OPCODE)
+#define V_CQE_OPCODE(x)   ((x)<<S_CQE_OPCODE)
+
+#define SW_CQE(x)         (G_CQE_SWCQE(be32_to_cpu((x).header)))
+#define CQE_OOO(x)        (G_CQE_OOO(be32_to_cpu((x).header)))
+#define CQE_QPID(x)       (G_CQE_QPID(be32_to_cpu((x).header)))
+#define CQE_GENBIT(x)     (G_CQE_GENBIT(be32_to_cpu((x).header)))
+#define CQE_TYPE(x)       (G_CQE_TYPE(be32_to_cpu((x).header)))
+#define SQ_TYPE(x)	  (CQE_TYPE((x)))
+#define RQ_TYPE(x)	  (!CQE_TYPE((x)))
+#define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
+#define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
+
+#define CQE_LEN(x)        (be32_to_cpu((x).len))
+
+/* used for RQ completion processing */
+#define CQE_WRID_STAG(x)  (be32_to_cpu((x).u.rcqe.stag))
+#define CQE_WRID_MSN(x)   (be32_to_cpu((x).u.rcqe.msn))
+
+/* used for SQ completion processing */
+#define CQE_WRID_SQ_WPTR(x)	((x).u.scqe.wrid_hi)
+#define CQE_WRID_WPTR(x)   	((x).u.scqe.wrid_low)
+
+/* generic accessor macros */
+#define CQE_WRID_HI(x)		((x).u.scqe.wrid_hi)
+#define CQE_WRID_LOW(x) 	((x).u.scqe.wrid_low)
+
+#define TPT_ERR_SUCCESS                     0x0
+#define TPT_ERR_STAG                        0x1	 /* STAG invalid: either the */
+						 /* STAG is offlimt, being 0, */
+						 /* or STAG_key mismatch */
+#define TPT_ERR_PDID                        0x2	 /* PDID mismatch */
+#define TPT_ERR_QPID                        0x3	 /* QPID mismatch */
+#define TPT_ERR_ACCESS                      0x4	 /* Invalid access right */
+#define TPT_ERR_WRAP                        0x5	 /* Wrap error */
+#define TPT_ERR_BOUND                       0x6	 /* base and bounds voilation */
+#define TPT_ERR_INVALIDATE_SHARED_MR        0x7	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND 0x8	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_ECC                         0x9	 /* ECC error detected */
+#define TPT_ERR_ECC_PSTAG                   0xA	 /* ECC error detected when  */
+						 /* reading PSTAG for a MW  */
+						 /* Invalidate */
+#define TPT_ERR_PBL_ADDR_BOUND              0xB	 /* pbl addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_SWFLUSH			    0xC	 /* SW FLUSHED */
+#define TPT_ERR_CRC                         0x10 /* CRC error */
+#define TPT_ERR_MARKER                      0x11 /* Marker error */
+#define TPT_ERR_PDU_LEN_ERR                 0x12 /* invalid PDU length */
+#define TPT_ERR_OUT_OF_RQE                  0x13 /* out of RQE */
+#define TPT_ERR_DDP_VERSION                 0x14 /* wrong DDP version */
+#define TPT_ERR_RDMA_VERSION                0x15 /* wrong RDMA version */
+#define TPT_ERR_OPCODE                      0x16 /* invalid rdma opcode */
+#define TPT_ERR_DDP_QUEUE_NUM               0x17 /* invalid ddp queue number */
+#define TPT_ERR_MSN                         0x18 /* MSN error */
+#define TPT_ERR_TBIT                        0x19 /* tag bit not set correctly */
+#define TPT_ERR_MO                          0x1A /* MO not 0 for TERMINATE  */
+						 /* or READ_REQ */
+#define TPT_ERR_MSN_GAP                     0x1B
+#define TPT_ERR_MSN_RANGE                   0x1C
+#define TPT_ERR_IRD_OVERFLOW                0x1D
+#define TPT_ERR_RQE_ADDR_BOUND              0x1E /* RQE addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_INTERNAL_ERR                0x1F /* internal error (opcode  */
+						 /* mismatch) */
+
+struct t3_swsq {
+	__u64 			wr_id;
+	struct t3_cqe 		cqe;
+	__u32			sq_wptr;
+	__be32			read_len;
+	int 			opcode;
+	int			complete;
+	int			signaled;	
+};
+
+/*
+ * A T3 WQ implements both the SQ and RQ.
+ */
+struct t3_wq {
+	union t3_wr *queue;		/* DMA accessable memory */
+	dma_addr_t dma_addr;		/* DMA address for HW */
+	DECLARE_PCI_UNMAP_ADDR(mapping)	/* unmap kruft */
+	u32 error;			/* 1 once we go to ERROR */
+	u32 qpid;
+	u32 wptr;			/* idx to next available WR slot */
+	u32 size_log2;			/* total wq size */
+	struct t3_swsq *sq;		/* SW SQ */
+	struct t3_swsq *oldest_read;	/* tracks oldest pending read */
+	u32 sq_wptr;			/* sq_wptr - sq_rptr == count of */
+	u32 sq_rptr;			/* pending wrs */
+	u32 sq_size_log2;		/* sq size */
+	u64 *rq;			/* SW RQ (holds consumer wr_ids */
+	u32 rq_wptr;			/* rq_wptr - rq_rptr == count of */
+	u32 rq_rptr;			/* pending wrs */
+	u64 *rq_oldest_wr;		/* oldest wr on the SW RQ */
+	u32 rq_size_log2;		/* rq size */
+	u32 rq_addr;			/* rq adapter address */
+	void __iomem *doorbell;		/* kernel db */
+	u64 udb;			/* user db if any */
+};
+
+struct t3_cq {
+	u32 cqid;
+	u32 rptr;
+	u32 wptr;
+	u32 size_log2;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct t3_cqe *queue;
+	struct t3_cqe *sw_queue;
+	u32 sw_rptr;
+	u32 sw_wptr;
+};
+
+#define CQ_VLD_ENTRY(ptr,size_log2,cqe) (Q_GENBIT(ptr,size_log2) == \
+					 CQE_GENBIT(*cqe))
+
+static inline void cxio_set_wq_in_error(struct t3_wq *wq)
+{
+	wq->queue->flit[13] = 1;
+}
+
+static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:37:46 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:37:46 -0600
Subject: [openib-general] [PATCH  v3 10/13] Core HAL
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223746.27166.57624.stgit@dell3.ogc.int>


The RDMA Core interfaces with the T3 HW and ULLD providing a low level
RDMA interface.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  201 ++++
 2 files changed, 1503 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 0000000..ffc4ec0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1302 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/semaphore.h>
+#include <asm/delay.h>
+
+#include <linux/netdevice.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+#include "sge_defs.h"
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+							     *tdev)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (rdev_tbl[i]->t3cdev_p == tdev)
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (!rdev_tbl[i]) {
+			rdev_tbl[i] = rdev_p;
+			break;
+		}
+	return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i] == rdev_p) {
+			rdev_tbl[i] = NULL;
+			break;
+		}
+}
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, 
+		   enum t3_cq_opcode op, u32 credit)
+{
+	int ret;
+	struct t3_cqe *cqe;
+	u32 rptr;
+
+	struct rdma_cq_op setup;
+	setup.id = cq->cqid;
+	setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+	setup.op = op;
+	ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup);
+
+	if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) 
+		return ret;
+
+	/*
+	 * If the rearm returned an index other than our current index,
+	 * then there might be CQE's in flight (being DMA'd).  We must wait
+	 * here for them to complete or the consumer can miss a notification.
+	 */
+	if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+		int i=0;
+
+		rptr = cq->rptr;
+
+		/* 
+		 * Keep the generation correct by bumping rptr until it
+		 * matches the index returned by the rearm - 1.
+	 	 */
+		while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+			rptr++;
+
+		/* 
+		 * Now rptr is the index for the (last) cqe that was 
+	 	 * in-flight at the time the HW rearmed the CQ.  We 
+		 * spin until that CQE is valid.
+	 	 */
+		cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2);
+		while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) {
+			udelay(1);
+			if (i++ > 1000000) {
+				BUG_ON(1);
+				printk(KERN_ERR "%s: stalled rnic\n", 
+				       rdev_p->dev_name);
+				return -EIO;
+			}
+		}
+	}
+	return 0;
+}
+
+static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cqid;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 0;		/* disaable the CQ */
+	setup.credits = 0;
+	setup.credit_thres = 0;
+	setup.ovfl_mode = 0;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
+{
+	u64 sge_cmd;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = qpid << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe);
+
+	cq->cqid = cxio_hal_get_cqid(rdev_p->rscp);
+	if (!cq->cqid)
+		return -ENOMEM;
+	cq->sw_queue = kzalloc(size, GFP_KERNEL);
+	if (!cq->sw_queue)
+		return -ENOMEM;
+	cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     (1UL << (cq->size_log2)) *
+					     sizeof(struct t3_cqe),
+					     &(cq->dma_addr), GFP_KERNEL);
+	if (!cq->queue) {
+		kfree(cq->sw_queue);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(cq, mapping, cq->dma_addr);
+	memset(cq->queue, 0, size);
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = 65535;
+	setup.credit_thres = 1;
+	if (rdev_p->t3cdev_p->type == T3B)
+		setup.ovfl_mode = 0;
+	else
+		setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = setup.size;
+	setup.credit_thres = setup.size;	/* TBD: overflow recovery */
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	u32 qpid;
+	int i;
+
+	mutex_lock(&uctx->lock);
+	if (!list_empty(&uctx->qpids)) {
+		entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, 
+				   entry);
+		list_del(&entry->entry);
+		qpid = entry->qpid;
+		kfree(entry);
+	} else {
+		qpid = cxio_hal_get_qpid(rdev_p->rscp);
+		if (!qpid) 
+			goto out;
+		for (i = qpid+1; i & rdev_p->qpmask; i++) {
+			entry = kmalloc(sizeof *entry, GFP_KERNEL);
+			if (!entry)
+				break;
+			entry->qpid = i;
+			list_add_tail(&entry->entry, &uctx->qpids);
+		}
+	}
+out:
+	mutex_unlock(&uctx->lock);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, 
+		     struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	
+	entry = kmalloc(sizeof *entry, GFP_KERNEL);
+	if (!entry) 
+		return;
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	entry->qpid = qpid;
+	mutex_lock(&uctx->lock);
+	list_add_tail(&entry->entry, &uctx->qpids);
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct list_head *pos, *nxt;
+	struct cxio_qpid_list *entry;
+
+	mutex_lock(&uctx->lock);
+	list_for_each_safe(pos, nxt, &uctx->qpids) {
+		entry = list_entry(pos, struct cxio_qpid_list, entry);
+		list_del_init(&entry->entry);
+		if (!(entry->qpid & rdev_p->qpmask))
+			cxio_hal_put_qpid(rdev_p->rscp, entry->qpid);
+		kfree(entry);
+	}
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	INIT_LIST_HEAD(&uctx->qpids);
+	mutex_init(&uctx->lock);
+}
+
+int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain,
+		   struct t3_wq *wq, struct cxio_ucontext *uctx)
+{
+	int depth = 1UL << wq->size_log2;
+	int rqsize = 1UL << wq->rq_size_log2;
+
+	wq->qpid = get_qpid(rdev_p, uctx);
+	if (!wq->qpid)
+		return -ENOMEM;
+
+	wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL);
+	if (!wq->rq)
+		goto err1;
+
+	wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize);
+	if (!wq->rq_addr)
+		goto err2;
+
+	wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL);
+	if (!wq->sq)
+		goto err3;
+	
+	wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     depth * sizeof(union t3_wr),
+					     &(wq->dma_addr), GFP_KERNEL);
+	if (!wq->queue)
+		goto err4;
+
+	memset(wq->queue, 0, depth * sizeof(union t3_wr));
+	pci_unmap_addr_set(wq, mapping, wq->dma_addr);
+	wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	if (!kernel_domain)
+		wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + 
+					(wq->qpid << rdev_p->qpshift);
+	PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, 
+	     wq->qpid, wq->doorbell, wq->udb);
+	return 0;
+err4:
+	kfree(wq->sq);
+err3:
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize);
+err2:
+	kfree(wq->rq);
+err1:
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return -ENOMEM;
+}
+
+int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	int err;
+	err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid);
+	kfree(cq->sw_queue);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (cq->size_log2))
+			  * sizeof(struct t3_cqe), cq->queue, 
+			  pci_unmap_addr(cq, mapping));
+	cxio_hal_put_cqid(rdev_p->rscp, cq->cqid);
+	return err;
+}
+
+int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (wq->size_log2))
+			  * sizeof(union t3_wr), wq->queue, 
+			  pci_unmap_addr(wq, mapping));
+	kfree(wq->sq);
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2));
+	kfree(wq->rq);
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return 0;
+}
+
+static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(T3_SEND) | 
+		         	 V_CQE_TYPE(0) |
+		         	 V_CQE_SWCQE(1) |
+		         	 V_CQE_QPID(wq->qpid) | 
+		         	 V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	u32 ptr;
+
+	PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq);
+
+	/* flush RQ */
+	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, 
+	    wq->rq_rptr, wq->rq_wptr, count);
+	ptr = wq->rq_rptr + count;
+	while (ptr++ != wq->rq_wptr)
+		insert_recv_cqe(wq, cq);
+}
+
+static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, 
+		          struct t3_swsq *sqp)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(sqp->opcode) |
+			         V_CQE_TYPE(1) |
+			         V_CQE_SWCQE(1) |
+			         V_CQE_QPID(wq->qpid) | 
+			         V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	cqe.u.scqe.wrid_hi = sqp->sq_wptr;
+
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	__u32 ptr;
+	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
+
+	ptr = wq->sq_rptr + count;
+	sqp += count;
+	while (ptr != wq->sq_wptr) {
+		insert_sq_cqe(wq, cq, sqp);
+		sqp++;
+		ptr++;
+	}
+}
+
+/* 
+ * Move all CQEs from the HWCQ into the SWCQ.
+ */
+void cxio_flush_hw_cq(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe, *swcqe;
+
+	PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid);
+	cqe = cxio_next_hw_cqe(cq);
+	while (cqe) {
+		PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", 
+		     __FUNCTION__, cq->rptr, cq->sw_wptr);
+		swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2);
+		*swcqe = *cqe;
+		swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1));
+		cq->sw_wptr++;
+		cq->rptr++;
+		cqe = cxio_next_hw_cqe(cq);
+	}
+}
+
+static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq)
+{
+	if (CQE_OPCODE(*cqe) == T3_TERMINATE) 
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) &&
+	    Q_EMPTY(wq->rq_rptr, wq->rq_wptr))
+		return 0;
+
+	return 1;
+}
+
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && 
+		    (CQE_QPID(*cqe) == wq->qpid))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	PDBG("%s count zero %d\n", __FUNCTION__, *count);
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && 
+		    (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p)
+{
+	struct rdma_cq_setup setup;
+	setup.id = 0;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 1;		/* enable the CQ */
+	setup.credits = 0;
+
+	/* force SGE to redirect to RspQ and interrupt */
+	setup.credit_thres = 0;	
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	int err;
+	u64 sge_cmd, ctx0, ctx1;
+	u64 base_addr;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+
+
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	err = cxio_hal_init_ctrl_cq(rdev_p);
+	if (err) {
+		PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err);
+		return err;
+	}
+	rdev_p->ctrl_qp.workq = dma_alloc_coherent(
+					&(rdev_p->rnic_info.pdev->dev),
+					(1 << T3_CTRL_QP_SIZE_LOG2) *
+					sizeof(union t3_wr),
+					&(rdev_p->ctrl_qp.dma_addr), 
+					GFP_KERNEL);
+	if (!rdev_p->ctrl_qp.workq) {
+		PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, 
+			   rdev_p->ctrl_qp.dma_addr);
+	rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	memset(rdev_p->ctrl_qp.workq, 0,
+	       (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr));
+
+	init_MUTEX(&rdev_p->ctrl_qp.sem);
+	init_waitqueue_head(&rdev_p->ctrl_qp.waitq);
+
+	/* update HW Ctrl QP context */
+	base_addr = rdev_p->ctrl_qp.dma_addr;
+	base_addr >>= 12;
+	ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) |
+		V_EC_BASE_LO((u32) base_addr & 0xffff));
+	ctx0 <<= 32;
+	ctx0 |= V_EC_CREDITS(FW_WR_NUM);
+	base_addr >>= 16;
+	ctx1 = (u32) base_addr;
+	base_addr >>= 32;
+	ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) |
+			V_EC_TYPE(0) | V_EC_GEN(1) |
+			V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32;
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1,
+		       T3_CTL_QP_TID, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = cpu_to_be64(ctx1);
+	wqe->ctx0 = cpu_to_be64(ctx0);
+	PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n",
+	     (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq,
+	     1 << T3_CTRL_QP_SIZE_LOG2);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << T3_CTRL_QP_SIZE_LOG2)
+			  * sizeof(union t3_wr), rdev_p->ctrl_qp.workq,
+			  pci_unmap_addr(&rdev_p->ctrl_qp, mapping));
+	return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID);
+}
+
+/* write len bytes of data into addr (32B aligned address) 
+ * If data is NULL, clear len byte of memory to zero.
+ * caller aquires the sem before the call
+ */
+static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
+				      u32 len, void *data, int completion)
+{
+	u32 i, nr_wqe, copy_len;
+	u8 *copy_data;
+	u8 wr_len, utx_len;	/* lenght in 8 byte flit */
+	enum t3_wr_flags flag;
+	__be64 *wqe;
+	u64 utx_cmd;
+	addr &= 0x7FFFFFF;
+	nr_wqe = len % 96 ? len / 96 + 1 : len / 96;	/* 96B max per WQE */
+	PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n",
+	     __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, 
+	     nr_wqe, data, addr);
+	utx_len = 3;		/* in 32B unit */
+	for (i = 0; i < nr_wqe; i++) {
+		if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr,
+		           T3_CTRL_QP_SIZE_LOG2)) {
+			PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, "
+			     "wait for more space i %d\n", __FUNCTION__, 
+			     rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i);
+			if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     !Q_FULL(rdev_p->ctrl_qp.rptr,
+						     rdev_p->ctrl_qp.wptr,
+						     T3_CTRL_QP_SIZE_LOG2))) {
+				PDBG("%s ctrl_qp workq interrupted\n",
+				     __FUNCTION__);
+				return -ERESTARTSYS;
+			}
+			PDBG("%s ctrl_qp wakeup, continue posting work request "
+			     "i %d\n", __FUNCTION__, i);
+		}
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+						(1 << T3_CTRL_QP_SIZE_LOG2)));
+		flag = 0;
+		if (i == (nr_wqe - 1)) {
+			/* last WQE */
+			flag = completion ? T3_COMPLETION_FLAG : 0;
+			if (len % 32)
+				utx_len = len / 32 + 1;
+			else
+				utx_len = len / 32;
+		}
+
+		/* 
+		 * Force a CQE to return the credit to the workq in case 
+		 * we posted more than half the max QP size of WRs 
+		 */
+		if ((i != 0) && 
+		    (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) {
+			flag = T3_COMPLETION_FLAG;
+			PDBG("%s force completion at i %d\n", __FUNCTION__, i);
+		}
+
+		/* build the utx mem command */
+		wqe += (sizeof(struct t3_bypass_wr) >> 3);
+		utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3);
+		utx_cmd <<= 32;
+		utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1);
+		*wqe = cpu_to_be64(utx_cmd);
+		wqe++;
+		copy_data = (u8 *) data + i * 96;
+		copy_len = len > 96 ? 96 : len;
+
+		/* clear memory content if data is NULL */
+		if (data)
+			memcpy(wqe, copy_data, copy_len);
+		else
+			memset(wqe, 0, copy_len);
+		if (copy_len % 32)
+			memset(((u8 *) wqe) + copy_len, 0,
+			       32 - (copy_len % 32));
+		wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + 
+			 (utx_len << 2);
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+
+		/* wptr in the WRID[31:0] */
+		((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr;
+
+		/* 
+		 * This must be the last write with a memory barrier 
+		 * for the genbit 
+		 */
+		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
+			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
+					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
+			       wr_len);
+		if (flag == T3_COMPLETION_FLAG)
+			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
+		len -= 96;
+		rdev_p->ctrl_qp.wptr++;
+	}
+	return 0;
+}
+
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
+ * OUT: stag index, actual pbl_size, pbl_addr allocated.
+ * TBD: shared memory region support
+ */
+static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
+			 u32 *stag, u8 stag_state, u32 pdid,
+			 enum tpt_mem_type type, enum tpt_mem_perm perm,
+			 u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl,
+			 u32 *pbl_size, u32 *pbl_addr)
+{
+	int err;
+	struct tpt_entry tpt;
+	u32 stag_idx;
+	u32 wptr;
+	int rereg = (*stag != T3_STAG_UNSET);
+
+	stag_state = stag_state > 0;
+	stag_idx = (*stag) >> 8;
+
+	if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) {
+		stag_idx = cxio_hal_get_stag(rdev_p->rscp);
+		if (!stag_idx)
+			return -ENOMEM;
+		*stag = (stag_idx << 8) | ((*stag) & 0xFF);
+	}
+	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", 
+	     __FUNCTION__, stag_state, type, pdid, stag_idx);
+	
+	if (reset_tpt_entry) 
+		cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3);
+	else if (!rereg) {
+		*pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3);
+		if (!*pbl_addr) {
+			return -ENOMEM;
+		}
+	}
+
+	down_interruptible(&rdev_p->ctrl_qp.sem);
+
+	/* write PBL first if any - update pbl only if pbl list exist */
+	if (pbl) {
+
+		PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
+		     __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, 
+		     *pbl_size);
+		err = cxio_hal_ctrl_qp_write_mem(rdev_p, 
+				(*pbl_addr >> 5),
+				(*pbl_size << 3), pbl, 0);
+		if (err)
+			goto ret;
+	}
+
+	/* write TPT entry */
+	if (reset_tpt_entry)
+		memset(&tpt, 0, sizeof(tpt));
+	else {
+		tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID |
+				V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) |
+				V_TPT_STAG_STATE(stag_state) |
+				V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid));
+		BUG_ON(page_size >= 28);
+		tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | 
+			    	F_TPT_MW_BIND_ENABLE |
+				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
+				V_TPT_PAGE_SIZE(page_size));
+		tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : 
+				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3));
+		tpt.len = cpu_to_be32(len);
+		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
+		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
+		tpt.rsvd_bind_cnt_or_pstag = 0;
+		tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : 
+				  cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2));
+	}
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				       stag_idx +
+				       (rdev_p->rnic_info.tpt_base >> 5),
+				       sizeof(tpt), &tpt, 1);
+
+	/* release the stag index to free pool */
+	if (reset_tpt_entry)
+		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
+ret:	
+	wptr = rdev_p->ctrl_qp.wptr;
+	up(&rdev_p->ctrl_qp.sem);
+	if (!err)
+		if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+						      wptr)))
+			return -ERESTARTSYS;
+	return err;
+}
+
+/* IN : stag key, pdid, pbl_size
+ * Out: stag index, actaul pbl_size, and pbl_addr allocated. 
+ */
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			      perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr));
+}
+
+int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     &pbl_size, &pbl_addr);
+}
+
+int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	u32 pbl_size = 0;
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
+			     NULL, &pbl_size, NULL);
+}
+
+int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
+{
+	struct t3_rdma_init_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p);
+	wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe));
+	wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT));
+	wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) |
+					   V_FW_RIWR_LEN(sizeof(*wqe) >> 3));
+	wqe->wrid.id1 = 0;
+	wqe->qpid = cpu_to_be32(attr->qpid);
+	wqe->pdid = cpu_to_be32(attr->pdid);
+	wqe->scqid = cpu_to_be32(attr->scqid);
+	wqe->rcqid = cpu_to_be32(attr->rcqid);
+	wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base);
+	wqe->rq_size = cpu_to_be32(attr->rq_size);
+	wqe->mpaattrs = attr->mpaattrs;
+	wqe->qpcaps = attr->qpcaps;
+	wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss);
+	wqe->flags = cpu_to_be32(attr->flags);
+	wqe->ord = cpu_to_be32(attr->ord);
+	wqe->ird = cpu_to_be32(attr->ird);
+	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
+	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
+	wqe->rsvd = 0;
+	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = ev_cb;
+}
+
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = NULL;
+}
+
+static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb)
+{
+	static int cnt;
+	struct cxio_rdev *rdev_p = NULL;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x"
+	     " se %0x notify %0x cqbranch %0x creditth %0x\n",
+	     cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg),
+	     RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg),
+	     RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg),
+	     RSPQ_CREDIT_THRESH(rsp_msg));
+	PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d "
+	     "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), 
+	     CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+	     CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), 
+	     CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+	rdev_p = (struct cxio_rdev *)t3cdev_p->ulp;
+	if (!rdev_p) {
+		PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__,
+		     t3cdev_p);
+		return 0;
+	}
+	if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) {
+		rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1;
+		wake_up_interruptible(&rdev_p->ctrl_qp.waitq);
+		dev_kfree_skb_irq(skb);
+	} else if (CQE_QPID(rsp_msg->cqe) == 0xfff8)
+		dev_kfree_skb_irq(skb);
+	else if (cxio_ev_cb)
+		(*cxio_ev_cb) (rdev_p, skb);
+	else
+		dev_kfree_skb_irq(skb);
+	cnt++;
+	return 0;
+}
+
+/* Caller takes care of locking if needed */
+int cxio_rdev_open(struct cxio_rdev *rdev_p)
+{
+	struct net_device *netdev_p = NULL;
+	int err = 0;
+	if (strlen(rdev_p->dev_name)) {
+		if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) {
+			return -EBUSY;
+		}
+		netdev_p = dev_get_by_name(rdev_p->dev_name);
+		if (!netdev_p) {
+			return -EINVAL;
+		}
+		dev_put(netdev_p);
+	} else if (rdev_p->t3cdev_p) {
+		if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) {
+			return -EBUSY;
+		}
+		netdev_p = rdev_p->t3cdev_p->lldev;
+		strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name,
+			T3_MAX_DEV_NAME_LEN);
+	} else {
+		PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	if (cxio_hal_add_rdev(rdev_p))
+		return -ENOMEM;
+
+	PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name);
+	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
+	if (!rdev_p->t3cdev_p)
+		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
+					 &(rdev_p->rnic_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS,
+				    &(rdev_p->port_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+
+	/* 
+	 * qpshift is the number of bits to shift the qpid left in order
+	 * to get the correct address of the doorbell for that qp.
+	 */
+	cxio_init_ucontext(rdev_p, &rdev_p->uctx);
+	rdev_p->qpshift = PAGE_SHIFT - 
+			  ilog2(65536 >> 
+			            ilog2(rdev_p->rnic_info.udbell_len >> 
+					      PAGE_SHIFT));
+	rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT;
+	rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1;
+	PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d "
+	     "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", 
+	     __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, 
+  	     rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), 
+  	     rdev_p->rnic_info.pbl_base, 
+  	     rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base,
+  	     rdev_p->rnic_info.rqt_top);
+	PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu "
+	     "qpnr %d qpmask 0x%x\n", 
+	     rdev_p->rnic_info.udbell_len, 
+	     rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr,
+	     rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask);
+
+	err = cxio_hal_init_ctrl_qp(rdev_p);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", 
+		       __FUNCTION__, err);
+		goto err1;
+	}
+ 	err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0,
+				     0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ,
+				     T3_MAX_NUM_PD);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing hal resources.\n", 
+		       __FUNCTION__, err);
+		goto err2;
+	}
+ 	err = cxio_hal_pblpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing pbl mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err3;
+ 	}
+ 	err = cxio_hal_rqtpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing rqt mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err4;
+ 	}
+  	return 0;
+err4:
+ 	cxio_hal_pblpool_destroy(rdev_p);
+err3:
+ 	cxio_hal_destroy_resource(rdev_p->rscp);
+err2:
+	cxio_hal_destroy_ctrl_qp(rdev_p);
+err1:
+	cxio_hal_delete_rdev(rdev_p);
+	return err;
+}
+
+void cxio_rdev_close(struct cxio_rdev *rdev_p)
+{
+	if (rdev_p) {
+		cxio_hal_pblpool_destroy(rdev_p);
+		cxio_hal_rqtpool_destroy(rdev_p);
+		cxio_hal_delete_rdev(rdev_p);
+		rdev_p->t3cdev_p->ulp = NULL;
+		cxio_hal_destroy_ctrl_qp(rdev_p);
+		cxio_hal_destroy_resource(rdev_p->rscp);
+	}
+}
+
+int __init cxio_hal_init(void)
+{
+	if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI))
+		return -ENOMEM;
+	memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *));
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler);
+	return 0;
+}
+
+void __exit cxio_hal_exit(void)
+{
+	int i;
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL);
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		cxio_rdev_close(rdev_tbl[i]);
+	cxio_hal_destroy_rhdl_resource();
+}
+
+static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_swsq *sqp;
+	__u32 ptr = wq->sq_rptr;
+	int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr);
+	
+	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
+	while (count--)
+		if (!sqp->signaled) {
+			ptr++;
+			sqp = wq->sq + Q_PTR2IDX(ptr,  wq->sq_size_log2);
+		} else if (sqp->complete) {
+
+			/* 
+			 * Insert this completed cqe into the swcq.
+			 */
+			PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n",
+			     __FUNCTION__, Q_PTR2IDX(ptr,  wq->sq_size_log2),
+			     Q_PTR2IDX(cq->sw_wptr, cq->size_log2));
+			sqp->cqe.header |= htonl(V_CQE_SWCQE(1));
+			*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) 
+				= sqp->cqe;
+			cq->sw_wptr++;
+			sqp->signaled = 0;
+			break;
+		} else
+			break;
+}
+
+static inline void create_read_req_cqe(struct t3_wq *wq,
+				       struct t3_cqe *hw_cqe,
+				       struct t3_cqe *read_cqe)
+{
+	read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr;
+	read_cqe->len = wq->oldest_read->read_len;
+	read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) |
+				 V_CQE_SWCQE(SW_CQE(*hw_cqe)) |
+				 V_CQE_OPCODE(T3_READ_REQ) |
+				 V_CQE_TYPE(1));
+}
+
+/*
+ * Return a ptr to the next read wr in the SWSQ or NULL.
+ */
+static inline void advance_oldest_read(struct t3_wq *wq)
+{
+
+	u32 rptr = wq->oldest_read - wq->sq + 1;
+	u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2);
+
+	while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) {
+		wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2);
+
+		if (wq->oldest_read->opcode == T3_READ_REQ)
+			return;
+		rptr++;
+	}
+	wq->oldest_read = NULL;
+}
+
+/*
+ * cxio_poll_cq
+ *
+ * Caller must:
+ *     check the validity of the first CQE,
+ *     supply the wq assicated with the qpid.
+ *
+ * credit: cq credit to return to sge.
+ * cqe_flushed: 1 iff the CQE is flushed.
+ * cqe: copy of the polled CQE.
+ *
+ * return value:
+ *     0       CQE returned,
+ *    -1       CQE skipped, try again.
+ */
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit)
+{
+	int ret = 0;
+	struct t3_cqe *hw_cqe, read_cqe;
+
+	*cqe_flushed = 0;
+	*credit = 0;
+	hw_cqe = cxio_next_cqe(cq);
+
+	PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x"
+	     " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), 
+	     CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), 
+	     CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), 
+	     CQE_WRID_LOW(*hw_cqe));
+
+	/* 
+	 * skip cqe's not affiliated with a QP.
+	 */
+	if (wq == NULL) {
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	/*
+	 * Gotta tweak READ completions:
+	 * 	1) the cqe doesn't contain the sq_wptr from the wr.
+	 *	2) opcode not reflected from the wr.
+	 *	3) read_len not reflected from the wr.
+	 *	4) cq_type is RQ_TYPE not SQ_TYPE.
+	 */
+	if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) {
+		
+		/* 
+	 	 * Don't write to the HWCQ, so create a new read req CQE 
+		 * in local memory.
+		 */
+		create_read_req_cqe(wq, hw_cqe, &read_cqe);
+		hw_cqe = &read_cqe;
+		advance_oldest_read(wq);
+	}
+
+	/*
+ 	 * T3A: Discard TERMINATE CQEs.
+	 */
+	if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) {
+		ret = -1;
+		wq->error = 1;
+		goto skip_cqe;
+	}
+
+	if (CQE_STATUS(*hw_cqe) || wq->error) {
+		*cqe_flushed = wq->error;
+		wq->error = 1;
+	
+		/* 
+		 * T3A inserts errors into the CQE.  We cannot return 
+	 	 * these as work completions.
+	 	 */
+		/* incoming write failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) 
+		     && RQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		/* incoming read request failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+
+		/* incoming SEND with no receive posted failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) &&
+		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+	 * RECV completion.
+	 */
+	if (RQ_TYPE(*hw_cqe)) {
+
+		/* 
+		 * HW only validates 4 bits of MSN.  So we must validate that
+		 * the MSN in the SEND is the next expected MSN.  If its not,
+		 * then we complete this with TPT_ERR_MSN and mark the wq in 
+		 * error.
+		 */
+		if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) {
+			wq->error = 1;
+			hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN));
+			goto proc_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/* 
+ 	 * If we get here its a send completion.
+	 *
+	 * Handle out of order completion. These get stuffed
+	 * in the SW SQ. Then the SW SQ is walked to move any
+	 * now in-order completions into the SW CQ.  This handles
+	 * 2 cases:
+	 * 	1) reaping unsignaled WRs when the first subsequent
+	 *	   signaled WR is completed.
+	 *	2) out of order read completions.
+	 */
+	if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) {
+		struct t3_swsq *sqp;
+
+		PDBG("%s out of order completion going in swsq at idx %ld\n",
+		     __FUNCTION__, 
+		     Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2));
+		sqp = wq->sq + 
+		      Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2);
+		sqp->cqe = *hw_cqe;
+		sqp->complete = 1;
+		ret = -1;
+		goto flush_wq;
+	}
+	
+proc_cqe:
+	*cqe = *hw_cqe;
+
+	/*
+	 * Reap the associated WR(s) that are freed up with this
+	 * completion.
+	 */
+	if (SQ_TYPE(*hw_cqe)) {
+		wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe);
+		PDBG("%s completing sq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2));
+		*cookie = (wq->sq + 
+			   Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id;
+		wq->sq_rptr++;
+	} else {
+		PDBG("%s completing rq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		*cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		wq->rq_rptr++;
+	}
+
+flush_wq:
+	/*
+	 * Flush any completed cqes that are now in-order.
+	 */
+	flush_completed_wrs(wq, cq);
+
+skip_cqe:
+	if (SW_CQE(*hw_cqe)) {
+		PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->sw_rptr);
+		++cq->sw_rptr;
+	} else {
+		PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->rptr);
+		++cq->rptr;
+
+		/*
+		 * T3A: compute credits.
+		 */
+		if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1)))
+		    || ((cq->rptr - cq->wptr) >= 128)) {
+			*credit = cq->rptr - cq->wptr;
+			cq->wptr = cq->rptr;
+		}
+	}
+	return ret;
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
new file mode 100644
index 0000000..bde5cfb
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
@@ -0,0 +1,201 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef  __CXIO_HAL_H__
+#define  __CXIO_HAL_H__
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#include "t3_cpl.h"
+#include "t3cdev.h"
+#include "cxgb3_ctl_defs.h"
+#include "cxio_wr.h"
+
+#define T3_CTRL_QP_ID    FW_RI_SGEEC_START
+#define T3_CTL_QP_TID	 FW_RI_TID_START
+#define T3_CTRL_QP_SIZE_LOG2  8
+#define T3_CTRL_CQ_ID    0
+
+/* TBD */
+#define T3_MAX_NUM_RNIC  8
+#define T3_MAX_NUM_RI (1<<15)
+#define T3_MAX_NUM_QP (1<<15)
+#define T3_MAX_NUM_CQ (1<<15)
+#define T3_MAX_NUM_PD (1<<15)
+#define T3_MAX_PBL_SIZE 256
+#define T3_MAX_RQ_SIZE 1024
+#define T3_MAX_NUM_STAG (1<<15)
+
+#define T3_STAG_UNSET 0xffffffff
+
+#define T3_MAX_DEV_NAME_LEN 32
+
+struct cxio_hal_ctrl_qp {
+	u32 wptr;
+	u32 rptr;
+	struct semaphore sem;	/* for the wtpr, can sleep */
+	wait_queue_head_t waitq;	/* wait for RspQ/CQE msg */
+	union t3_wr *workq;	/* the work request queue */
+	dma_addr_t dma_addr;	/* pci bus address of the workq */
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	void __iomem *doorbell;
+};
+
+struct cxio_hal_resource {
+	struct kfifo *tpt_fifo;
+	spinlock_t tpt_fifo_lock;
+	struct kfifo *qpid_fifo;
+	spinlock_t qpid_fifo_lock;
+	struct kfifo *cqid_fifo;
+	spinlock_t cqid_fifo_lock;
+	struct kfifo *pdid_fifo;
+	spinlock_t pdid_fifo_lock;
+};
+
+struct cxio_qpid_list {
+	struct list_head entry;
+	u32 qpid;
+};
+
+struct cxio_ucontext {
+	struct list_head qpids;
+	struct mutex lock;
+};
+
+struct cxio_rdev {
+	char dev_name[T3_MAX_DEV_NAME_LEN];
+	struct t3cdev *t3cdev_p;
+	struct rdma_info rnic_info;
+	struct adap_ports port_info;
+	struct cxio_hal_resource *rscp;
+	struct cxio_hal_ctrl_qp ctrl_qp;
+	void *ulp;
+	unsigned long qpshift;
+	u32 qpnr;
+	u32 qpmask;
+	struct cxio_ucontext uctx;
+	struct gen_pool *pbl_pool;
+	struct gen_pool *rqt_pool;
+};
+
+static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
+{
+	return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5));
+}
+
+typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p,
+					     struct sk_buff * skb);
+
+#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff)
+#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff)
+#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1)
+#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1)
+#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1)
+#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1)
+#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1)
+#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1)
+#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1)
+
+struct respQ_msg_t {
+	__be32 flags;		/* flit 0 */
+	__be32 cq_ptrid;
+	__be64 rsvd;		/* flit 1 */
+	struct t3_cqe cqe;	/* flits 2-3 */
+};
+
+enum t3_cq_opcode {
+	CQ_ARM_AN = 0x2,
+	CQ_ARM_SE = 0x6,
+	CQ_FORCE_AN = 0x3,
+	CQ_CREDIT_UPDATE = 0x7
+};
+
+int cxio_rdev_open(struct cxio_rdev *rdev);
+void cxio_rdev_close(struct cxio_rdev *rdev);
+int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, 
+	 	   enum t3_cq_opcode op, u32 credit);
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid);
+int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq,
+		   struct cxio_ucontext *uctx);
+int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx);
+int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr);
+int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr);
+int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
+int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+u32 cxio_hal_get_rhdl(void);
+void cxio_hal_put_rhdl(u32 rhdl);
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
+int __init cxio_hal_init(void);
+void __exit cxio_hal_exit(void);
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_flush_hw_cq(struct t3_cq *cq);
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+
+#define MOD "iw_cxgb3: "
+#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
+
+#ifdef DEBUG
+void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag);
+void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift);
+void cxio_dump_wqe(union t3_wr *wqe);
+void cxio_dump_wce(struct t3_cqe *wce);
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents);
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid);
+#endif
+
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:38:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:38:16 -0600
Subject: [openib-general] [PATCH  v3 11/13] Core Resource Allocation
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223816.27166.81499.stgit@dell3.ogc.int>


Core functions to carve up adapter memory, stag, qp, and cq IDs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  331 ++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_resource.h |   70 +++++
 2 files changed, 401 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 0000000..444df15
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,331 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+				   spinlock_t *fifo_lock,
+				   u32 nr, u32 skip_low,
+				   u32 skip_high,
+				   int random)
+{
+	u32 i, j, entry = 0, idx;
+	u32 random_bytes;
+	u32 rarray[16];
+	spin_lock_init(fifo_lock);
+
+	*fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+	if (IS_ERR(*fifo))
+		return -ENOMEM;
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		__kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32));
+	if (random) {
+		j = 0;
+		random_bytes = random32();
+		for (i = 0; i < RANDOM_SIZE; i++)
+			rarray[i] = i + skip_low;
+		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+			if (j >= RANDOM_SIZE) {
+				j = 0;
+				random_bytes = random32();
+			}
+			idx = (random_bytes >> (j * 2)) & 0xF;
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[idx],
+				sizeof(u32));
+			rarray[idx] = i;
+			j++;	
+		}
+		for (i = 0; i < RANDOM_SIZE; i++)
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[i],
+				sizeof(u32));
+	} else
+		for (i = skip_low; i < nr - skip_high; i++)
+			__kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32));
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32));
+	return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+				   spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 1));
+}
+
+static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p)
+{
+	u32 i;
+
+	spin_lock_init(&rdev_p->rscp->qpid_fifo_lock);
+
+	rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), 
+					      GFP_KERNEL, 
+					      &rdev_p->rscp->qpid_fifo_lock);
+	if (IS_ERR(rdev_p->rscp->qpid_fifo))
+		return -ENOMEM;
+
+	for (i = 16; i < T3_MAX_NUM_QP; i++)
+		if (!(i & rdev_p->qpmask))
+			__kfifo_put(rdev_p->rscp->qpid_fifo, 
+				    (unsigned char *) &i, sizeof(u32));
+	return 0;
+}
+
+int cxio_hal_init_rhdl_resource(u32 nr_rhdl)
+{
+	return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1,
+				       0);
+}
+
+void cxio_hal_destroy_rhdl_resource(void)
+{
+	kfifo_free(rhdl_fifo);
+}
+
+/* nr_* must be power of 2 */
+int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+			   u32 nr_tpt, u32 nr_pbl,
+			   u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid)
+{
+	int err = 0;
+	struct cxio_hal_resource *rscp;
+
+	rscp = kmalloc(sizeof(*rscp), GFP_KERNEL);
+	if (!rscp)
+		return -ENOMEM;
+	rdev_p->rscp = rscp;
+	err = cxio_init_resource_fifo_random(&rscp->tpt_fifo,
+				      &rscp->tpt_fifo_lock, 
+				      nr_tpt, 1, 0);
+	if (err)
+		goto tpt_err;
+	err = cxio_init_qpid_fifo(rdev_p);
+	if (err)
+		goto qpid_err;
+	err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, 
+				      nr_cqid, 1, 0);
+	if (err)
+		goto cqid_err;
+	err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, 
+				      nr_pdid, 1, 0);
+	if (err)
+		goto pdid_err;
+	return 0;
+pdid_err:
+	kfifo_free(rscp->cqid_fifo);
+cqid_err:
+	kfifo_free(rscp->qpid_fifo);
+qpid_err:
+	kfifo_free(rscp->tpt_fifo);
+tpt_err:
+	return -ENOMEM;
+}
+
+/*
+ * returns 0 if no resource available
+ */
+static inline u32 cxio_hal_get_resource(struct kfifo *fifo)
+{
+	u32 entry;
+	if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32)))
+		return entry;
+	else
+		return 0;	/* fifo emptry */
+}
+
+static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry)
+{
+	BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0);
+}
+
+u32 cxio_hal_get_rhdl(void)
+{
+	return cxio_hal_get_resource(rhdl_fifo);
+}
+
+void cxio_hal_put_rhdl(u32 rhdl)
+{
+	cxio_hal_put_resource(rhdl_fifo, rhdl);
+}
+
+u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->tpt_fifo);
+}
+
+void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag)
+{
+	cxio_hal_put_resource(rscp->tpt_fifo, stag);
+}
+
+u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp)
+{
+	u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid)
+{
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	cxio_hal_put_resource(rscp->qpid_fifo, qpid);
+}
+
+u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->cqid_fifo);
+}
+
+void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid)
+{
+	cxio_hal_put_resource(rscp->cqid_fifo, cqid);
+}
+
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->pdid_fifo);
+}
+
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid)
+{
+	cxio_hal_put_resource(rscp->pdid_fifo, pdid);
+}
+
+void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
+{
+	kfifo_free(rscp->tpt_fifo);
+	kfifo_free(rscp->cqid_fifo);
+	kfifo_free(rscp->qpid_fifo);
+	kfifo_free(rscp->pdid_fifo);
+	kfree(rscp);
+}
+
+/*
+ * PBL Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_PBL_SHIFT 8			/* 256B == min PBL size (32 entries) */
+#define PBL_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size);
+	return (u32)addr;
+}
+
+void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size);
+	gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size);
+}
+
+int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1);
+	if (rdev_p->pbl_pool)
+		for (i = rdev_p->rnic_info.pbl_base; 
+		     i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; 
+		     i += PBL_CHUNK)
+			gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1);
+	return rdev_p->pbl_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->pbl_pool);
+}
+
+/*
+ * RQT Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_RQT_SHIFT 10	/* 1KB == mini RQT size (16 entries) */
+#define RQT_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6);
+	return (u32)addr;
+}
+
+void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6);
+	gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6);
+}
+
+int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1);
+	if (rdev_p->rqt_pool)
+		for (i = rdev_p->rnic_info.rqt_base; 
+		     i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; 
+		     i += RQT_CHUNK)
+			gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1);
+	return rdev_p->rqt_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->rqt_pool);
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
new file mode 100644
index 0000000..a6bbe83
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_RESOURCE_H__
+#define __CXIO_RESOURCE_H__
+
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/genalloc.h>
+#include "cxio_hal.h"
+
+extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl);
+extern void cxio_hal_destroy_rhdl_resource(void);
+extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+				  u32 nr_tpt, u32 nr_pbl,
+				  u32 nr_rqt, u32 nr_qpid, u32 nr_cqid,
+				  u32 nr_pdid);
+extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag);
+extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid);
+extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid);
+extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp);
+
+#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base )
+extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+
+#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base )
+extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:38:46 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:38:46 -0600
Subject: [openib-general] [PATCH  v3 12/13] Core Debug functions
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223846.27166.55367.stgit@dell3.ogc.int>


Debug code to dump various data structs, some of which are in 
adapter memory.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  205 +++++++++++++++++++++++++++
 1 files changed, 205 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 0000000..22f4f75
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include <linux/types.h>
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) 
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size = 32;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+	m->len = size;
+	PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size, npages;
+
+	shift += 12;
+	npages = (len + (1ULL << shift) - 1) >> shift;
+	size = npages * sizeof(u64);
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = pbl_addr;
+	m->len = size;
+	PDBG("%s PBL addr 0x%x len %d depth %d\n", 
+		__FUNCTION__, m->addr, m->len, npages);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+	__be64 *data = (__be64 *)wqe;
+	uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+	if (size == 0) 
+		size = 8;
+	while (size > 0) {
+		PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+		size--;
+		data++;
+	}
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+	__be64 *data = (__be64 *)wce;
+	int size = sizeof(*wce);
+
+	while (size > 0) {
+		PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+		size -= 8;
+		data++;
+	}
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+	struct ch_mem_range *m;
+	int size = nents * 64;
+	u64 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base;
+	m->len = size;
+	PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid)
+{
+	struct ch_mem_range *m;
+	int size = TCB_SIZE;
+	u32 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_CM;
+	m->addr = hwtid * size; 
+	m->len = size;
+	PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u32 *)m->buf;
+	while (size > 0) {
+		printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", 
+			m->addr, 
+			*(data+2), *(data+3), *(data),*(data+1),
+			*(data+6), *(data+7), *(data+4), *(data+5));
+		size -= 32;
+		data += 8;
+		m->addr += 32;
+	}
+	kfree(m);
+}
+#endif


From swise at opengridcomputing.com  Sun Dec 10 14:39:16 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Sun, 10 Dec 2006 16:39:16 -0600
Subject: [openib-general] [PATCH  v3 13/13] Kconfig/Makefile
In-Reply-To: <20061210223244.27166.36192.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <20061210223916.27166.82130.stgit@dell3.ogc.int>


Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/Kconfig              |    1 +
 drivers/infiniband/Makefile             |    1 +
 drivers/infiniband/hw/cxgb3/Kconfig     |   27 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/Makefile    |   12 ++++++++++++
 drivers/infiniband/hw/cxgb3/locking.txt |   25 +++++++++++++++++++++++++
 5 files changed, 66 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 59b3932..06453ab 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/ehca/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index 570b30a..69bdd55 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mt
 obj-$(CONFIG_INFINIBAND_IPATH)		+= hw/ipath/
 obj-$(CONFIG_INFINIBAND_EHCA)		+= hw/ehca/
 obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
+obj-$(CONFIG_INFINIBAND_CXGB3)		+= hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 0000000..84f0f6e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,27 @@
+config INFINIBAND_CXGB3
+	tristate "Chelsio RDMA Driver"
+	depends on CHELSIO_T3 && INFINIBAND
+	select GENERIC_ALLOCATOR
+	---help---
+	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
+	  10GbE adapters.
+
+          For general information about Chelsio and our products, visit
+          our website at <http://www.chelsio.com>.
+
+          For customer support, please visit our customer support page at
+          <http://www.chelsio.com/support.htm>.
+
+          Please send feedback to <linux-bugs at chelsio.com>.
+
+          To compile this driver as a module, choose M here: the module
+          will be called iw_cxgb3.
+
+config INFINIBAND_CXGB3_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_CXGB3
+	default n
+	---help---
+	  This option causes the Chelsio RDMA driver to produce copious
+	  amounts of debug messages.  Select this if you are developing
+	  the driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 0000000..0df2b3d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \
+		-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+	       iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_CXGB3_DEBUG
+EXTRA_CFLAGS += -DDEBUG -O1 -g 
+iw_cxgb3-y += core/cxio_dbg.o
+endif
diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt
new file mode 100644
index 0000000..e5e9991
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/locking.txt
@@ -0,0 +1,25 @@
+cq lock:
+	- spin lock
+	- used to synchronize the t3_cq
+
+qp lock:
+	- spin lock
+	- used to synchronize updates to the qp state, attrs, and the t3_wq.
+	- touched on interrupt and process context
+	
+rnicp lock:
+	- spin lock
+	- touched on interrupt and process context
+	- used around lookup tables mapping CQID and QPID to a structure.
+	- used also to bump the refcnt atomically with the lookup.
+
+poll:
+	lock+disable on cq lock
+		lock qp lock for each cqe that is polled around the call
+		to cxio_poll_cq().
+	
+post: 
+	lock+disable qp lock
+
+global mutex iwch_mutex:
+	used to maintain global device list.


From sashak at voltaire.com  Sun Dec 10 14:56:13 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 00:56:13 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210215956.GI9205@mellanox.co.il>
References: <20061210215033.GC21155@sashak.voltaire.com>
	<20061210215956.GI9205@mellanox.co.il>
Message-ID: <20061210225613.GF21155@sashak.voltaire.com>

On 23:59 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > Recently I found this OFA 'Userspace Git Trees' downloading howto:
> > 
> > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> > 
> > and thought that we could make it simpler for end-user to choose the
> > "right" git tree just by adding one more series of symbolic links under
> > /pub/scm. This links will point to the maintainer's "official" trees, and
> > we will have only one such link per project.
> > 
> > So typical downloading howto for end-users will looks like:
> > 
> >   git clone git://staging.openfabrics.org/dapl
> >   git clone git://staging.openfabrics.org/ibutils
> >   git clone git://staging.openfabrics.org/imgen
> >   ...
> > 
> > instead of
> > 
> >   git clone git://staging.openfabrics.org/~ardavis/dapl
> >   git clone git://staging.openfabrics.org/~eitan/ibutils
> >   git clone git://staging.openfabrics.org/~mst/imgen
> >   ...
> > 
> > as it is now.
> 
> NACK, please remove this. These soft links are messy, and
> the fact that one needs root just to add a tree shows just how the approach
> is broken.

No, it is not instead, but in addition to ~user/ links, so root is _not_
required to add tree.

> If you have some temporary tree, just mention this in description,

And when it is not temporary tree?

> and gitweb will show this. And won't the problem basically go away
> if you move ~sashak temporary trees out of ~/scm?

For me it is unclear yet how long we may need this - 1.1 still be in
SVN yet, and 1.1 git branch is updated there.

> It seems we don't
> have a lot of duplicates besides that.

But we will have - we are running git hosting only week or so and already
talking about pre-trunk trees for some projects. :)


Other opinions?

Sasha


From randy.dunlap at oracle.com  Sun Dec 10 14:56:02 2006
From: randy.dunlap at oracle.com (Randy Dunlap)
Date: Sun, 10 Dec 2006 14:56:02 -0800
Subject: [openib-general] [PATCH  v3 13/13] Kconfig/Makefile
In-Reply-To: <20061210223916.27166.82130.stgit@dell3.ogc.int>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<20061210223916.27166.82130.stgit@dell3.ogc.int>
Message-ID: <20061210145602.d2a8bb98.randy.dunlap@oracle.com>

On Sun, 10 Dec 2006 16:39:16 -0600 Steve Wise wrote:

>  drivers/infiniband/Kconfig              |    1 +
>  drivers/infiniband/Makefile             |    1 +
>  drivers/infiniband/hw/cxgb3/Kconfig     |   27 +++++++++++++++++++++++++++
>  drivers/infiniband/hw/cxgb3/Makefile    |   12 ++++++++++++
>  drivers/infiniband/hw/cxgb3/locking.txt |   25 +++++++++++++++++++++++++
>  5 files changed, 66 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
> new file mode 100644
> index 0000000..84f0f6e
> --- /dev/null
> +++ b/drivers/infiniband/hw/cxgb3/Kconfig
> @@ -0,0 +1,27 @@
> +config INFINIBAND_CXGB3
> +	tristate "Chelsio RDMA Driver"
> +	depends on CHELSIO_T3 && INFINIBAND
> +	select GENERIC_ALLOCATOR
> +	---help---
> +	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
> +	  10GbE adapters.
> +
> +          For general information about Chelsio and our products, visit
> +          our website at <http://www.chelsio.com>.
> +
> +          For customer support, please visit our customer support page at
> +          <http://www.chelsio.com/support.htm>.
> +
> +          Please send feedback to <linux-bugs at chelsio.com>.
> +
> +          To compile this driver as a module, choose M here: the module
> +          will be called iw_cxgb3.

Please indent all of that the same amount.
Kconfig help text should be indented 1 tab + 2 spaces,
like the first 2 lines are.


> diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt
> new file mode 100644
> index 0000000..e5e9991
> --- /dev/null
> +++ b/drivers/infiniband/hw/cxgb3/locking.txt
> @@ -0,0 +1,25 @@
> +cq lock:
> +	- spin lock
> +	- used to synchronize the t3_cq
> +
> +qp lock:
> +	- spin lock
> +	- used to synchronize updates to the qp state, attrs, and the t3_wq.
> +	- touched on interrupt and process context
> +	
> +rnicp lock:
> +	- spin lock
> +	- touched on interrupt and process context
> +	- used around lookup tables mapping CQID and QPID to a structure.
> +	- used also to bump the refcnt atomically with the lookup.
> +
> +poll:
> +	lock+disable on cq lock
> +		lock qp lock for each cqe that is polled around the call
> +		to cxio_poll_cq().
> +	
> +post: 
> +	lock+disable qp lock
> +
> +global mutex iwch_mutex:
> +	used to maintain global device list.

Should be in Documentation/infiniband/.
Docs go in the Documentation/ dir, not in drivers/ dir.

---
~Randy


From swise at opengridcomputing.com  Sun Dec 10 15:04:14 2006
From: swise at opengridcomputing.com (Steve WIse)
Date: Sun, 10 Dec 2006 17:04:14 -0600
Subject: [openib-general] [PATCH  v3 13/13] Kconfig/Makefile
In-Reply-To: <20061210145602.d2a8bb98.randy.dunlap@oracle.com>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<20061210223916.27166.82130.stgit@dell3.ogc.int>
	<20061210145602.d2a8bb98.randy.dunlap@oracle.com>
Message-ID: <1165791854.25243.11.camel@linux-q667.site>

> > +++ b/drivers/infiniband/hw/cxgb3/Kconfig
> > @@ -0,0 +1,27 @@
> > +config INFINIBAND_CXGB3
> > +	tristate "Chelsio RDMA Driver"
> > +	depends on CHELSIO_T3 && INFINIBAND
> > +	select GENERIC_ALLOCATOR
> > +	---help---
> > +	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
> > +	  10GbE adapters.
> > +
> > +          For general information about Chelsio and our products, visit
> > +          our website at <http://www.chelsio.com>.
> > +
> > +          For customer support, please visit our customer support page at
> > +          <http://www.chelsio.com/support.htm>.
> > +
> > +          Please send feedback to <linux-bugs at chelsio.com>.
> > +
> > +          To compile this driver as a module, choose M here: the module
> > +          will be called iw_cxgb3.
> 
> Please indent all of that the same amount.
> Kconfig help text should be indented 1 tab + 2 spaces,
> like the first 2 lines are.
> 

Will do.

> 
> > diff --git a/drivers/infiniband/hw/cxgb3/locking.txt b/drivers/infiniband/hw/cxgb3/locking.txt
> > new file mode 100644
> > index 0000000..e5e9991
> > --- /dev/null
> > +++ b/drivers/infiniband/hw/cxgb3/locking.txt
> > @@ -0,0 +1,25 @@
> > +cq lock:
> > +	- spin lock
> > +	- used to synchronize the t3_cq
> > +
> > +qp lock:
> > +	- spin lock
> > +	- used to synchronize updates to the qp state, attrs, and the t3_wq.
> > +	- touched on interrupt and process context
> > +	
> > +rnicp lock:
> > +	- spin lock
> > +	- touched on interrupt and process context
> > +	- used around lookup tables mapping CQID and QPID to a structure.
> > +	- used also to bump the refcnt atomically with the lookup.
> > +
> > +poll:
> > +	lock+disable on cq lock
> > +		lock qp lock for each cqe that is polled around the call
> > +		to cxio_poll_cq().
> > +	
> > +post: 
> > +	lock+disable qp lock
> > +
> > +global mutex iwch_mutex:
> > +	used to maintain global device list.
> 
> Should be in Documentation/infiniband/.
> Docs go in the Documentation/ dir, not in drivers/ dir.
> 

I think I'll just remove this file.  I don't think its that useful...


Steve.


From mst at mellanox.co.il  Sun Dec 10 15:05:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 01:05:15 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210225613.GF21155@sashak.voltaire.com>
References: <20061210215033.GC21155@sashak.voltaire.com>
	<20061210215956.GI9205@mellanox.co.il>
	<20061210225613.GF21155@sashak.voltaire.com>
Message-ID: <20061210230515.GJ9205@mellanox.co.il>

> > > Recently I found this OFA 'Userspace Git Trees' downloading howto:
> > > 
> > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> > > 
> > > and thought that we could make it simpler for end-user to choose the
> > > "right" git tree just by adding one more series of symbolic links under
> > > /pub/scm. This links will point to the maintainer's "official" trees, and
> > > we will have only one such link per project.
> > > 
> > > So typical downloading howto for end-users will looks like:
> > > 
> > >   git clone git://staging.openfabrics.org/dapl
> > >   git clone git://staging.openfabrics.org/ibutils
> > >   git clone git://staging.openfabrics.org/imgen
> > >   ...
> > > 
> > > instead of
> > > 
> > >   git clone git://staging.openfabrics.org/~ardavis/dapl
> > >   git clone git://staging.openfabrics.org/~eitan/ibutils
> > >   git clone git://staging.openfabrics.org/~mst/imgen
> > >   ...
> > > 
> > > as it is now.
> > 
> > NACK, please remove this. These soft links are messy, and
> > the fact that one needs root just to add a tree shows just how the approach
> > is broken.
> 
> No, it is not instead, but in addition to ~user/ links, so root is _not_
> required to add tree.

right but suddenly root is needed to make it "official".
Let's avoid the whole policy-setting-by-softlinks.
"I have root" should not equal, or be required for "I say what's official".

> > If you have some temporary tree, just mention this in description,
> 
> And when it is not temporary tree?

Say what it is in the description.
Put a link in wiki.

> > and gitweb will show this. And won't the problem basically go away
> > if you move ~sashak temporary trees out of ~/scm?
> 
> For me it is unclear yet how long we may need this - 1.1 still be in
> SVN yet, and 1.1 git branch is updated there.

So ~sashak/scm things track the 1.1 branch in git?
Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly?

> > It seems we don't
> > have a lot of duplicates besides that.
> 
> But we will have - we are running git hosting only week or so and already
> talking about pre-trunk trees for some projects. :)

These should be branches, not separate trees.
So no issue there.

-- 
MST


From mst at mellanox.co.il  Sun Dec 10 15:10:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 01:10:04 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061210221805.GD21155@sashak.voltaire.com>
References: <20061210221805.GD21155@sashak.voltaire.com>
Message-ID: <20061210231004.GK9205@mellanox.co.il>

> On 23:39 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > > Sean, you can do
> > > 
> > >   chmod 755 hooks/post-update
> > > 
> > > This hook runs git-server-update-info after each push.
> > 
> > It seems we really want this as default.
> > Sasha, could you please
> > chmod 755 /usr/share/git-core/templates/hooks/pre-commit
> > so that this will be the default for all new users?
> 
> Would prefer to not do this. All hooks are "off" is reasonable default
> IMO and this should be tree maintainer's decision to enable specific
> hook or not.
> 
> If somebody needs help with setup, we can help, or we could write sort
> of 'howto' if there are common problems. But I think we cannot take
> "ownership" there.

Defaults should be sane and help people.
Everyone can still override the template or disable
the hook.
So how is it taking ownership?

-- 
MST


From eeb at bartonsoftware.com  Sun Dec 10 15:08:45 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Sun, 10 Dec 2006 23:08:45 -0000
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <adapsau6t1p.fsf@cisco.com>
Message-ID: <076a01c71cb0$244a7630$0281a8c0@ebpc>

Roland,

> No other kernel subsystem has one, so I don't think it's realistic to
> expect one for IB.

Don't you think it would be useful?  Even if only to make API changes
explicit?

                Cheers,
                        Eric


From swise at opengridcomputing.com  Sun Dec 10 15:15:15 2006
From: swise at opengridcomputing.com (Steve WIse)
Date: Sun, 10 Dec 2006 17:15:15 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <1165530250.14449.85.camel@stevo-desktop>
References: <BAE9DCEF64577A439B3A37F36F9B691C014BE756@orsmsx418.amr.corp.intel.com>
	<1165530250.14449.85.camel@stevo-desktop>
Message-ID: <1165792515.25243.20.camel@linux-q667.site>

On Thu, 2006-12-07 at 16:24 -0600, Steve Wise wrote:
> On Thu, 2006-12-07 at 14:21 -0800, Woodruff, Robert J wrote:
> > Steve wrote,
> > >Yea maybe.  For now, you get everything I need to make cxgb3 run on
> > >2.6.19.  I'll think about the multiple branch approach. 
> > 
> > The issue is this. I am working on putting together an OFA integration
> > tree that integrates several components from several different
> > developers.
> > The same will be true when we start to integrate code into OFED 1.2.
> > Most code will come from Linus's tree, but some code will need to
> > come directly from the developer's git trees and we will need 
> > a way to generate a patch for only your code, as we will get things like
> > the local_sa cache code directly from Sean's. 
> > 
> > So if you can make a branch that only contains the cxgb3 code, it makes
> > generating a patch with only your code easier, and this will be needed
> > both for my early OFA integration work and also for OFED 1.2. 
> > Once your code is upstream, life is easier as we will get it from
> > linus, until then we'd like a way to patch the existing released kernel
> > (2.6.19 in this case) with your code. 
> > 
> > make sense ?
> 
> I understand.

I've updated the tree and it now includes 2 branches:  cxgb3 and
cxgb3_prereqs.  To see only the Chelsio T3 drivers (with needed
infiniband/core changes):

git-diff --patch-with-stat cxgb3_prereqs cxgb3

The cxgb3_prereqs branch includes anything I want in my tree for testing
the chelsio code.  Currently that includes krping and Sean's ucma code.
BTW: the IWCM core fixes are now in linus's tree so I no longer  need
them explicitly.  The cxgb3 branch includes all from the cxgb3_prereqs
branch plus all the T3 drivers under review now.

NOTE:  This git tree is backed against Linus's tree and I merged up to
his latest on 12/8.  So it's past 2.6.19 and now depends on changes that
are post 2.6.19 (the workqueue changes).


Steve.


From sashak at voltaire.com  Sun Dec 10 15:28:44 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 01:28:44 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061210231004.GK9205@mellanox.co.il>
References: <20061210221805.GD21155@sashak.voltaire.com>
	<20061210231004.GK9205@mellanox.co.il>
Message-ID: <20061210232844.GA32199@sashak.voltaire.com>

On 01:10 Mon 11 Dec     , Michael S. Tsirkin wrote:
> > On 23:39 Sun 10 Dec     , Michael S. Tsirkin wrote:
> > > > Sean, you can do
> > > > 
> > > >   chmod 755 hooks/post-update
> > > > 
> > > > This hook runs git-server-update-info after each push.
> > > 
> > > It seems we really want this as default.
> > > Sasha, could you please
> > > chmod 755 /usr/share/git-core/templates/hooks/pre-commit
> > > so that this will be the default for all new users?
> > 
> > Would prefer to not do this. All hooks are "off" is reasonable default
> > IMO and this should be tree maintainer's decision to enable specific
> > hook or not.
> > 
> > If somebody needs help with setup, we can help, or we could write sort
> > of 'howto' if there are common problems. But I think we cannot take
> > "ownership" there.
> 
> Defaults should be sane and help people.
> Everyone can still override the template or disable
> the hook.

Right, and everyone can enable this, if _he_ wants.

Sasha


From sashak at voltaire.com  Sun Dec 10 15:36:57 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 01:36:57 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210230515.GJ9205@mellanox.co.il>
References: <20061210215033.GC21155@sashak.voltaire.com>
	<20061210215956.GI9205@mellanox.co.il>
	<20061210225613.GF21155@sashak.voltaire.com>
	<20061210230515.GJ9205@mellanox.co.il>
Message-ID: <20061210233657.GB32199@sashak.voltaire.com>

On 01:05 Mon 11 Dec     , Michael S. Tsirkin wrote:
> > > > Recently I found this OFA 'Userspace Git Trees' downloading howto:
> > > > 
> > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> > > > 
> > > > and thought that we could make it simpler for end-user to choose the
> > > > "right" git tree just by adding one more series of symbolic links under
> > > > /pub/scm. This links will point to the maintainer's "official" trees, and
> > > > we will have only one such link per project.
> > > > 
> > > > So typical downloading howto for end-users will looks like:
> > > > 
> > > >   git clone git://staging.openfabrics.org/dapl
> > > >   git clone git://staging.openfabrics.org/ibutils
> > > >   git clone git://staging.openfabrics.org/imgen
> > > >   ...
> > > > 
> > > > instead of
> > > > 
> > > >   git clone git://staging.openfabrics.org/~ardavis/dapl
> > > >   git clone git://staging.openfabrics.org/~eitan/ibutils
> > > >   git clone git://staging.openfabrics.org/~mst/imgen
> > > >   ...
> > > > 
> > > > as it is now.
> > > 
> > > NACK, please remove this. These soft links are messy, and
> > > the fact that one needs root just to add a tree shows just how the approach
> > > is broken.
> > 
> > No, it is not instead, but in addition to ~user/ links, so root is _not_
> > required to add tree.
> 
> right but suddenly root is needed to make it "official".
> Let's avoid the whole policy-setting-by-softlinks.
> "I have root" should not equal, or be required for "I say what's official".

What are you trying to avoid? That only sysadmin will decide which git
tree will be "official" for OFED and which will not?

> 
> > > If you have some temporary tree, just mention this in description,
> > 
> > And when it is not temporary tree?
> 
> Say what it is in the description.
> Put a link in wiki.
> 
> > > and gitweb will show this. And won't the problem basically go away
> > > if you move ~sashak temporary trees out of ~/scm?
> > 
> > For me it is unclear yet how long we may need this - 1.1 still be in
> > SVN yet, and 1.1 git branch is updated there.
> 
> So ~sashak/scm things track the 1.1 branch in git?

All active SVN branches.

> Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly?
> 
> > > It seems we don't
> > > have a lot of duplicates besides that.
> > 
> > But we will have - we are running git hosting only week or so and already
> > talking about pre-trunk trees for some projects. :)
> 
> These should be branches, not separate trees.

Why not?

Sasha


From rdreier at cisco.com  Sun Dec 10 20:02:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 10 Dec 2006 20:02:20 -0800
Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
Message-ID: <adafybn2i7n.fsf@cisco.com>

I haven't seen any evidence of the corresponding ethernet NIC driver
being merged for 2.6.20 (which is a prerequisite, right).

What's the status of that?

 - R.


From rdreier at cisco.com  Sun Dec 10 21:02:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 10 Dec 2006 21:02:20 -0800
Subject: [openib-general] [PATCH  v3 13/13] Kconfig/Makefile
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<20061210223916.27166.82130.stgit@dell3.ogc.int>
	<20061210145602.d2a8bb98.randy.dunlap@oracle.com>
Message-ID: <adaac1v2ffn.fsf@cisco.com>

 > > +++ b/drivers/infiniband/hw/cxgb3/locking.txt

 > Should be in Documentation/infiniband/.
 > Docs go in the Documentation/ dir, not in drivers/ dir.

Or put it in a comment in the appropriate header, if you want to keep
it close to the driver source...


From rdreier at cisco.com  Sun Dec 10 21:27:20 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Sun, 10 Dec 2006 21:27:20 -0800
Subject: [openib-general] cannot clone librdmacm
References: <20061210221805.GD21155@sashak.voltaire.com>
	<20061210231004.GK9205@mellanox.co.il>
	<20061210232844.GA32199@sashak.voltaire.com>
Message-ID: <ada4ps32e9z.fsf@cisco.com>

 > Right, and everyone can enable this, if _he_ wants.

I think the point is that in the OFA environment, there's no obvious
reason to disable the hook, since without the hook http:// transport
is broken.

So it makes sense to help people who aren't necessarily git experts,
and pick a default that makes things work smoothly.  Experts can
disable the hook if there's some reason to do so (although to be
honest I don't see any reason).

 - R.


From mst at mellanox.co.il  Sun Dec 10 21:48:08 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 07:48:08 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210233657.GB32199@sashak.voltaire.com>
References: <20061210233657.GB32199@sashak.voltaire.com>
Message-ID: <20061211054539.GL9205@mellanox.co.il>

> > > > > Recently I found this OFA 'Userspace Git Trees' downloading howto:
> > > > > 
> > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> > > > > 
> > > > > and thought that we could make it simpler for end-user to choose the
> > > > > "right" git tree just by adding one more series of symbolic links under
> > > > > /pub/scm. This links will point to the maintainer's "official" trees, and
> > > > > we will have only one such link per project.
> > > > > 
> > > > > So typical downloading howto for end-users will looks like:
> > > > > 
> > > > >   git clone git://staging.openfabrics.org/dapl
> > > > >   git clone git://staging.openfabrics.org/ibutils
> > > > >   git clone git://staging.openfabrics.org/imgen
> > > > >   ...
> > > > > 
> > > > > instead of
> > > > > 
> > > > >   git clone git://staging.openfabrics.org/~ardavis/dapl
> > > > >   git clone git://staging.openfabrics.org/~eitan/ibutils
> > > > >   git clone git://staging.openfabrics.org/~mst/imgen
> > > > >   ...
> > > > > 
> > > > > as it is now.
> > > > 
> > > > NACK, please remove this. These soft links are messy, and
> > > > the fact that one needs root just to add a tree shows just how the approach
> > > > is broken.
> > > 
> > > No, it is not instead, but in addition to ~user/ links, so root is _not_
> > > required to add tree.
> > 
> > right but suddenly root is needed to make it "official".
> > Let's avoid the whole policy-setting-by-softlinks.
> > "I have root" should not equal, or be required for "I say what's official".
> 
> What are you trying to avoid? That only sysadmin will decide which git
> tree will be "official" for OFED and which will not?

Yes. Another point is that I should not need sysadmin priviledges to create
a new project and declare my tree the official source.

But not only that - staging is used to develop more than just OFED.  Read
the rant part in the original mail if you like for more detail - development
trees should all be equal. Only releases should be official.  And release has an
immutable name, so it does not *matter* which tree you get it from.

> > 
> > > > If you have some temporary tree, just mention this in description,
> > > 
> > > And when it is not temporary tree?
> > 
> > Say what it is in the description.
> > Put a link in wiki.
> > 
> > > > and gitweb will show this. And won't the problem basically go away
> > > > if you move ~sashak temporary trees out of ~/scm?
> > > 
> > > For me it is unclear yet how long we may need this - 1.1 still be in
> > > SVN yet, and 1.1 git branch is updated there.
> > 
> > So ~sashak/scm things track the 1.1 branch in git?
> 
> All active SVN branches.

But there *shouldn't* be any active SVN branches now besides the 1.1 branch.
So the rest can be killed off.

> > Move it to ~sashak/scm/ofed-1.1 then, and set the description accordingly?
> > 
> > > > It seems we don't
> > > > have a lot of duplicates besides that.
> > > 
> > > But we will have - we are running git hosting only week or so and already
> > > talking about pre-trunk trees for some projects. :)
> > 
> > These should be branches, not separate trees.
> 
> Why not?

You seem to have a fear of branches :). Many trees do not buy you anything,
I tried this with ofed 1.1 in the beginning.

You can have many trees. But a single project maintained by a single person
belongs in a single public tree, scattering it around between multiple trees
just makes it messy for people to track, and messy to figure out the delta
between branches. Finally, it wastes space.

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 00:24:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 10:24:10 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061210134137.GL29174@mellanox.co.il>
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
	<20061210134137.GL29174@mellanox.co.il>
Message-ID: <20061211082410.GB29276@mellanox.co.il>

> The following patch adds experimental support for IPoIB connected mode.
> The idea is to increase performance by increasing the MTU
> from the maximum of 2K (theoretically 4K) supported by IPoIB on top of UD.
> With this code, I'm able to get 800MByte/sec or more with netperf
> without options on a Mellanox 4x back-to-back DDR system.
> 
> Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

BTW, Roland, could you give me some indication on whether this
has a chance getting into 2.6.20? If yes I'll stop writing new code
and focus on polishing this.

-- 
MST


From erezz at voltaire.com  Mon Dec 11 01:20:55 2006
From: erezz at voltaire.com (Erez Zilber)
Date: Mon, 11 Dec 2006 11:20:55 +0200
Subject: [openib-general] open-iscsi update for OFED 1.2
In-Reply-To: <20061127071729.GA6925@mellanox.co.il>
References: <456A8FB5.9060602@voltaire.com>
	<20061127071729.GA6925@mellanox.co.il>
Message-ID: <457D22F7.6060507@voltaire.com>

Michael S. Tsirkin wrote:
>>> More than this - since ofed really starts from kernel.org kernel,
>>> just give us the list of files and ofed scripts will check that
>>> out and build. You'll have to backport open-iscsi to distro kernels though.
>>>       
Here are the open-iscsi kernel files:

drivers/scsi/iscsi_tcp.c
drivers/scsi/iscsi_tcp.h
drivers/scsi/libiscsi.c
drivers/scsi/scsi_transport_iscsi.c
include/scsi/iscsi_if.h
include/scsi/iscsi_proto.h
include/scsi/libiscsi.h
include/scsi/scsi_transport_iscsi.h

Thanks,
Erez


From mst at mellanox.co.il  Mon Dec 11 01:25:56 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 11:25:56 +0200
Subject: [openib-general] open-iscsi update for OFED 1.2
In-Reply-To: <457D22F7.6060507@voltaire.com>
References: <457D22F7.6060507@voltaire.com>
Message-ID: <20061211092556.GC29276@mellanox.co.il>

> >>> More than this - since ofed really starts from kernel.org kernel,
> >>> just give us the list of files and ofed scripts will check that
> >>> out and build. You'll have to backport open-iscsi to distro kernels though.
> >>>       
> Here are the open-iscsi kernel files:
> 
> drivers/scsi/iscsi_tcp.c
> drivers/scsi/iscsi_tcp.h
> drivers/scsi/libiscsi.c
> drivers/scsi/scsi_transport_iscsi.c
> include/scsi/iscsi_if.h
> include/scsi/iscsi_proto.h
> include/scsi/libiscsi.h
> include/scsi/scsi_transport_iscsi.h

OK. So after we'll add that to checkout scripts (hope Vlad can do this this
week), next thing you'll need is to add backport patches/addons and update makefile
to build iscsi.

-- 
MST


From erezz at voltaire.com  Mon Dec 11 01:31:21 2006
From: erezz at voltaire.com (Erez Zilber)
Date: Mon, 11 Dec 2006 11:31:21 +0200
Subject: [openib-general] open-iscsi update for OFED 1.2
In-Reply-To: <20061211092556.GC29276@mellanox.co.il>
References: <457D22F7.6060507@voltaire.com>
	<20061211092556.GC29276@mellanox.co.il>
Message-ID: <457D2569.2000805@voltaire.com>

Michael S. Tsirkin wrote:
>>>>> More than this - since ofed really starts from kernel.org kernel,
>>>>> just give us the list of files and ofed scripts will check that
>>>>> out and build. You'll have to backport open-iscsi to distro kernels though.
>>>>>       
>>>>>           
>> Here are the open-iscsi kernel files:
>>
>> drivers/scsi/iscsi_tcp.c
>> drivers/scsi/iscsi_tcp.h
>> drivers/scsi/libiscsi.c
>> drivers/scsi/scsi_transport_iscsi.c
>> include/scsi/iscsi_if.h
>> include/scsi/iscsi_proto.h
>> include/scsi/libiscsi.h
>> include/scsi/scsi_transport_iscsi.h
>>     
>
> OK. So after we'll add that to checkout scripts (hope Vlad can do this this
> week), next thing you'll need is to add backport patches/addons and update makefile
> to build iscsi.
>
>   
I understand that the kernel version that OFED 1.2 will be based on is
unknown yet (or am I wrong?). In order to create backport patches to a
specific distro, I need to know where I start from (i.e which kernel
version).

Erez


From eitan at mellanox.co.il  Mon Dec 11 01:51:29 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 11 Dec 2006 11:51:29 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
Message-ID: <457D2A21.9030804@mellanox.co.il>

Hi,

Currently libsdp.conf is installed into $prefix/etc.
This seems a little non standard to me. Instead I would think it needs 
to go
into /etc/infiniband/libsdp.conf.

Any comments - please speak up.

BTW: libsdp.conf used to be overwritten in previous install.
I have fixed the nakefile to avoid that and instead create a
new file with install date under the same directory.

Thanks

Eitan


From mst at mellanox.co.il  Mon Dec 11 02:06:07 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 12:06:07 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <457D2A21.9030804@mellanox.co.il>
References: <457D2A21.9030804@mellanox.co.il>
Message-ID: <20061211100607.GF29276@mellanox.co.il>


Quoting r. Eitan Zahavi <eitan at mellanox.co.il>:
Subject: libsdp: RFC changing libsdp.conf location

Hi,

> Currently libsdp.conf is installed into $prefix/etc.
> This seems a little non standard to me.

There's no real standard on configuration files in Unix.
So you can do whatever you want within reason.

> Instead I would think it needs 
> to go
> into /etc/infiniband/libsdp.conf.

/etc/infiniband is an OFED thing.
I suggest keeping libsdp separate so that it is
distribution agnostic.

> 
> Any comments - please speak up.

In the past, lots of customers asked that installed files reside
under $prefix. It *is* important since it lets people
find out easily what is added to their systems.
OFED does not follow this rule 100% but its better not to
add more exceptions.

> BTW: libsdp.conf used to be overwritten in previous install.
> I have fixed the nakefile to avoid that and instead create a
> new file with install date under the same directory.

So installed file hits a different location
depending on date and on whether I have an old library installed?
This pretty much guarantees user won't be able to find the file you have
installed: you seem to assume that users read installation logs but that's
typically not the case.

Why not just have libsdp.conf.example, or something like that, under $prefix/etc
and install that always, and only copy to $prefix/etc/libsdp.conf
if that does not exist?

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 02:22:22 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 12:22:22 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <457D2A21.9030804@mellanox.co.il>
References: <457D2A21.9030804@mellanox.co.il>
Message-ID: <20061211102222.GB5944@mellanox.co.il>

> BTW: libsdp.conf used to be overwritten in previous install.
> I have fixed the nakefile to avoid that and instead create a
> new file with install date under the same directory.

Here's a simple proposal that will address this issue:
- Make libsdp behave sanely when not libsdp.conf file is present.
  Do not install anything in default location in make install.

- in make install, copy the example configuration file into
  libsdp.conf.example. Add a line to the top of it saying
  "rename this file to libsdp.conf to make lbisdp use it".

-- 
MST


From eitan at mellanox.co.il  Mon Dec 11 02:26:49 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 11 Dec 2006 12:26:49 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <20061211102222.GB5944@mellanox.co.il>
References: <457D2A21.9030804@mellanox.co.il>
	<20061211102222.GB5944@mellanox.co.il>
Message-ID: <457D3269.3070401@mellanox.co.il>

Hi Michael,

Thanks. This proposal is simple and clear to me.
Let's wait a day and see if anybody else have other ideas.

Thanks

Eitan

Michael S. Tsirkin wrote:
>> BTW: libsdp.conf used to be overwritten in previous install.
>> I have fixed the nakefile to avoid that and instead create a
>> new file with install date under the same directory.
>>     
>
> Here's a simple proposal that will address this issue:
> - Make libsdp behave sanely when not libsdp.conf file is present.
>   Do not install anything in default location in make install.
>
> - in make install, copy the example configuration file into
>   libsdp.conf.example. Add a line to the top of it saying
>   "rename this file to libsdp.conf to make lbisdp use it".
>
>   


From mst at mellanox.co.il  Mon Dec 11 02:26:56 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 12:26:56 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <20061211102222.GB5944@mellanox.co.il>
References: <457D2A21.9030804@mellanox.co.il>
	<20061211102222.GB5944@mellanox.co.il>
Message-ID: <20061211102656.GC5944@mellanox.co.il>

> - Make libsdp behave sanely when not libsdp.conf file is present.

This should have been "when libsdp.conf file is not present" :).

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 06:48:13 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 16:48:13 +0200
Subject: [openib-general] ofed backports update
Message-ID: <20061211144813.GA15870@mellanox.co.il>

Here's a small update on OFED 1.2 backports. This describes a change
I did a couple of weeks ago but never got to documenting.
NOTE: This info is relevant only for people developing OFED kernel code,
everything is transparent for others.

NOTE: This is by *no means* a comprehensive writeup of OFED build process -
just a small update for people familiar with development in OFED 1.1.

Background:
OFED 1.1 did all backports by applying patches under
kernel_patches/backports/<kernel version>/ directory.
To back-port a package, you just stuck a patch there
and one OFED detected an appropriate kernel, it was applied before build.
In many cases - where the kernel we are back-porting to was simply
missing some macro - what patch actually did was just add a file
under the include directory, and OFED build scripts knew to pick
these up before standard linux includes.
Managing these became somewhat of a pain as it is often hard to
see the history of a patch: try git diff on a patch that sits in git tree
and see what I mean.

Update:
So for OFED 1.2 I've created a new directory kernel_addons, and converted
all patches that created new files to plain files under the relevant
kernel directory.  OFED scripts now look there for files before standard
Linux headers.
For an example, look at how backport to 2.6.18 looks:
http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD
Unfortunately, not all patches are of this form - some really tweak source
inside the infiniband subtree - but we can strive to reduce the number of this
and in this way make maintaining backports more of a seamless process.

Bottom line
There are now 2 mechanisms for back-porting in OFED:
- if you want to add a kernel-specific file, stick it under
  kernel_addons/backport/<kernel-version>/.
- if you must change an existing file depending on kernel version, stick
  a patch in kernel_patches/backports/<kernel version>/.

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 07:06:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 17:06:43 +0200
Subject: [openib-general] open-iscsi update for OFED 1.2
In-Reply-To: <457D2569.2000805@voltaire.com>
References: <457D2569.2000805@voltaire.com>
Message-ID: <20061211150643.GC15870@mellanox.co.il>

> >>>>> More than this - since ofed really starts from kernel.org kernel,
> >>>>> just give us the list of files and ofed scripts will check that
> >>>>> out and build. You'll have to backport open-iscsi to distro kernels though.
> >>>>>       
> >>>>>           
> >> Here are the open-iscsi kernel files:
> >>
> >> drivers/scsi/iscsi_tcp.c
> >> drivers/scsi/iscsi_tcp.h
> >> drivers/scsi/libiscsi.c
> >> drivers/scsi/scsi_transport_iscsi.c
> >> include/scsi/iscsi_if.h
> >> include/scsi/iscsi_proto.h
> >> include/scsi/libiscsi.h
> >> include/scsi/scsi_transport_iscsi.h
> >>     
> >
> > OK. So after we'll add that to checkout scripts (hope Vlad can do this this
> > week), next thing you'll need is to add backport patches/addons and update makefile
> > to build iscsi.
> >
> >   
> I understand that the kernel version that OFED 1.2 will be based on is
> unknown yet (or am I wrong?). In order to create backport patches to a
> specific distro, I need to know where I start from (i.e which kernel
> version).

Not really. Start from here:
git://staging.openfabrics.org/~vlad/ofed_1_2/.git

This is currently based on 2.6.19.
Clone this and work off ofed_1_2, test and ask for pull.

Then when there's an -rc from Linus, iser build might break and then you'll need to
fix the backports. However, from experience, if backports are done carefully enough
(separating the actual code in new header files) this is either easy or nothing
breaks. See the mail I've just sent to openib on new tricks we have in OFED 1.2
to make this easier.

-- 
MST


From sashak at voltaire.com  Mon Dec 11 07:28:01 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 17:28:01 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <ada4ps32e9z.fsf@cisco.com>
References: <20061210221805.GD21155@sashak.voltaire.com>
	<20061210231004.GK9205@mellanox.co.il>
	<20061210232844.GA32199@sashak.voltaire.com>
	<ada4ps32e9z.fsf@cisco.com>
Message-ID: <20061211152801.GC465@sashak.voltaire.com>

On 21:27 Sun 10 Dec     , Roland Dreier wrote:
>  > Right, and everyone can enable this, if _he_ wants.
> 
> I think the point is that in the OFA environment, there's no obvious
> reason to disable the hook, since without the hook http:// transport
> is broken.
> 
> So it makes sense to help people who aren't necessarily git experts,

IMO the maintaining the public git repository requires some minimal
experience anyway and I don't think that hiding such details under
default settings is so helpful - "default" git defaults are reasonable
start point.

And I guess to make 'chmod +x hooks/port-update' (after such stuff as
running git-init-db, editing description, pushing whole history, etc...)
is not big issue. If I'm wrong about it and it is I'm ready to help to
each one, who needs such help.

> and pick a default that makes things work smoothly.  Experts can
> disable the hook if there's some reason to do so (although to be
> honest I don't see any reason).

I may not see any reason too, but this should not be my decision (or
such aggressive "suggestion" as hook enabled by default).

BTW we likely will want to setup email notification hooks as well, this
can be more "complicated" than just 'chmod +x'. I guess we will no want
to prepare executable hook template with predefined email addresses...

Sasha


From jlentini at netapp.com  Mon Dec 11 07:23:11 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 11 Dec 2006 10:23:11 -0500 (EST)
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <4579C6C3.5090207@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
Message-ID: <Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>


A couple of questions Vu:

What NFS-RDMA release are you using? This looks like release 7.

Is this reproducible?

What kernel version are you using?

What hardware is this on? It looks like x86-64 to me, which is fine. I 
just want to be sure I know what I'm looking at. As many specifics as 
possible is good (number of CPUs, hyperthreading, etc.)

Could you send the output of 

objdump -Slr /path/to/kernel/mm/swap.o

Actually, just the put_page disassembly is all I want to see.

Is there any more text available? Usually there is an explanation 
given for an oops message (e.g. "Unable to handle kernel paging 
request..").

I opened a bug at the NFS-RDMA SourceForge project to track this:

http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583

Thanks for reporting this.
james

On Fri, 8 Dec 2006, Vu Pham wrote:

> Hi James,
>   I got these errors in server's /var/log/messages and then the server stop
> responding to login, I/O...; however, the server is still up, ipoib is still
> working
> 
> 
> Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
> [<ffffffff8025dff7>] put_page+0x17/0x40
> Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 00010246
> Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001
> RCX: 000000000003ffff
> Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001
> RDI: ffff8102274e92f8
> Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034
> R09: 0000000000000000
> Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000
> R12: ffff81020ef96800
> Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000
> R15: ffff8102053ee890
> Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
> GS:ffff81022066eb40(0000) knlGS:0000000000000000
> Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000
> CR4: 00000000000006e0
> Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
> ffff810219dde000, task ffff81020d87f0c0)
> Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 ffff81020ef96968
> ffff81020ef96800 ffff81020ef96958
> Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
> ffffffff80424e05 0000000000000000
> Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
> ffffffff80239b90 ffff81020d87f0c0
> Dec  8 06:38:21 ibd201 kernel: Call Trace:
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
> :sunrpc:svc_rdma_put_context+0x37/0xd0
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
> schedule_timeout+0x95/0xb0
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] process_timeout+0x0/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
> wait_for_completion_timeout+0xcd/0x150
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> default_wake_function+0x0/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
> :ib_mthca:mthca_cmd_post+0x232/0x260
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> default_wake_function+0x0/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] __next_cpu+0x19/0x30
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
> find_busiest_group+0x24e/0x6d0
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] thread_return+0x0/0xde
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
> _spin_unlock_irqrestore+0x8/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
> try_to_del_timer_sync+0x51/0x60
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] del_timer_sync+0xc/0x20
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
> schedule_timeout+0x95/0xb0
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
> :sunrpc:svc_recv+0x416/0x510
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> default_wake_function+0x0/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> default_wake_function+0x0/0x10
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] :nfsd:nfsd+0x111/0x380
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] child_rip+0xa/0x12
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] child_rip+0x0/0x12
> Dec  8 06:38:21 ibd201 kernel:
> Dec  8 06:38:21 ibd201 kernel:
> Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08
> 0f 94 c0 84 c0 74
> Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] put_page+0x17/0x40
> Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
> 
> -vu
> 


From mst at mellanox.co.il  Mon Dec 11 07:25:52 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 17:25:52 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061211152801.GC465@sashak.voltaire.com>
References: <20061211152801.GC465@sashak.voltaire.com>
Message-ID: <20061211152552.GD15870@mellanox.co.il>

> On 21:27 Sun 10 Dec     , Roland Dreier wrote:
> >  > Right, and everyone can enable this, if _he_ wants.
> > 
> > I think the point is that in the OFA environment, there's no obvious
> > reason to disable the hook, since without the hook http:// transport
> > is broken.
> > 
> > So it makes sense to help people who aren't necessarily git experts,
> 
> IMO the maintaining the public git repository requires some minimal
> experience anyway and I don't think that hiding such details under
> default settings is so helpful - "default" git defaults are reasonable
> start point.

They are, for repositories not exposed with http.
Since all repositories on staging are exposed with http,
the default is wrong in that case and need to be fixed.
Agree?

-- 
MST


From swise at opengridcomputing.com  Mon Dec 11 07:36:29 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 Dec 2006 09:36:29 -0600
Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <adafybn2i7n.fsf@cisco.com>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<adafybn2i7n.fsf@cisco.com>
Message-ID: <1165851389.13419.3.camel@stevo-desktop>

On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote:
> I haven't seen any evidence of the corresponding ethernet NIC driver
> being merged for 2.6.20 (which is a prerequisite, right).
> 
> What's the status of that?
> 

It is on its third or fourth round of review.  The last driver posted on
12/7, was merged up to linus's latest tree probably as of 12/7.  I know
the comments set it was against 2.6.19, but it was really linus's
latest.

Divy, can you expand on this?


Steve.


From mst at mellanox.co.il  Mon Dec 11 07:39:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 17:39:10 +0200
Subject: [openib-general] [PATCH untested] mthca: speed up memory
	registrations
Message-ID: <20061211153910.GE15870@mellanox.co.il>

Speed up memory registration by filling in MTTs directly.  This reduces the
number of FW commands needed to register an MR by at least a factor of 2.  This
applies to all memfree cards, and to tavor mode on 64 bit systems with the patch
I posted earlier.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

Roland, I'm posting this untested patch to get style comments out of the way
early while I'm testing it.

Note that this *not* FMR - this is strictly compliant IB memory registration since
MPTs are still updated using FW command.

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de
 int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd);
 void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
 
+int mthca_write_mtt_chunk_size(struct mthca_dev *dev);
+
 struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size);
 void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt);
 int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de
 	kfree(mtt);
 }
 
-int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
-		    int start_index, u64 *buffer_list, int list_len)
+static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			     int start_index, u64 *buffer_list, int list_len)
 {
 	struct mthca_mailbox *mailbox;
 	__be64 *mtt_entry;
@@ -296,6 +296,84 @@ out:
 	return err;
 }
 
+void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	u64 __iomem *mtts;
+	u32 mtt_seg;
+	int i;
+
+	mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE;
+       	mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64);
+	for (i = 0; i < list_len; ++i) {
+		__be64 mtt_entry = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+		mthca_write64_raw(mtt_entry, mtts + i);
+	}
+}
+
+void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	__be64 *mtts;
+	int i;
+	int s = start_index * sizeof (u64);
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64)) / PAGE_SIZE);
+	/* Require full segments */
+	BUG_ON(s % MTHCA_MTT_SEG_SIZE);
+
+	mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg +
+				s / MTHCA_MTT_SEG_SIZE);
+
+	BUG_ON(!mtts);
+
+	for (i = 0; i < list_len; ++i)
+		mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT);
+}
+
+int mthca_write_mtt_size(struct mthca_dev *dev)
+{
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		/*
+		 * Be friendly to WRITE_MTT command
+		 * and leave two empty slots for the
+		 * index and reserved fields of the
+		 * mailbox.
+		 */
+		return PAGE_SIZE / sizeof (u64) - 2;
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff;
+}
+
+int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+		    int start_index, u64 *buffer_list, int list_len)
+{
+	int size = mthca_write_mtt_size(dev);
+	int chunk;
+
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len);
+
+	while (list_len > 0) {
+		chunk = min(size, list_len);
+		if (mthca_is_memfree(dev))
+			mthca_arbel_write_mtt_seg(dev, mtt, start_index,
+						       	buffer_list, list_len);
+		else
+			mthca_tavor_write_mtt_seg(dev, mtt, start_index,
+						       	buffer_list, list_len);
+
+		list_len    -= chunk;
+		start_index += chunk;
+		buffer_list += chunk;
+	}
+
+	return 0;
+}
+
 static inline u32 tavor_hw_index_to_key(u32 ind)
 {
 	return ind;
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s
 	int shift, n, len;
 	int i, j, k;
 	int err = 0;
+	int write_mtt_size;
 
 	shift = ffs(region->page_size) - 1;
 
@@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s
 
 	i = n = 0;
 
+	write_mtt_size = min(mthca_write_mtt_size(dev), PAGE_SIZE / sizeof *pages);
+
 	list_for_each_entry(chunk, &region->chunk_list, list)
 		for (j = 0; j < chunk->nmap; ++j) {
 			len = sg_dma_len(&chunk->page_list[j]) >> shift;
@@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s
 				pages[i++] = sg_dma_address(&chunk->page_list[j]) +
 					region->page_size * k;
 				/*
-				 * Be friendly to WRITE_MTT command
-				 * and leave two empty slots for the
-				 * index and reserved fields of the
-				 * mailbox.
+				 * Be friendly to write_mtt and pass it chunks
+				 * of appropriate size.
 				 */
-				if (i == PAGE_SIZE / sizeof (u64) - 2) {
-					err = mthca_write_mtt(dev, mr->mtt,
-							      n, pages, i);
+				if (i == write_mtt_size) {
+					err = mthca_write_mtt(dev, mr->mtt, n, pages, i);
 					if (err)
 						goto mtt_done;
 					n += i;
-- 
MST


From sashak at voltaire.com  Mon Dec 11 09:16:01 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 19:16:01 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061211152552.GD15870@mellanox.co.il>
References: <20061211152801.GC465@sashak.voltaire.com>
	<20061211152552.GD15870@mellanox.co.il>
Message-ID: <20061211171601.GG465@sashak.voltaire.com>

On 17:25 Mon 11 Dec     , Michael S. Tsirkin wrote:
> > On 21:27 Sun 10 Dec     , Roland Dreier wrote:
> > >  > Right, and everyone can enable this, if _he_ wants.
> > > 
> > > I think the point is that in the OFA environment, there's no obvious
> > > reason to disable the hook, since without the hook http:// transport
> > > is broken.
> > > 
> > > So it makes sense to help people who aren't necessarily git experts,
> > 
> > IMO the maintaining the public git repository requires some minimal
> > experience anyway and I don't think that hiding such details under
> > default settings is so helpful - "default" git defaults are reasonable
> > start point.
> 
> They are, for repositories not exposed with http.
> Since all repositories on staging are exposed with http,

I don't know this about all repositories on staging (including yet not
created ones, where default will affect).

> the default is wrong in that case and need to be fixed.
> Agree?

No (and already tried to explain why).

Sasha


From sweitzen at cisco.com  Mon Dec 11 09:19:45 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 11 Dec 2006 09:19:45 -0800
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FA71@xmb-sjc-216.amer.cisco.com>

It's not clear to me.

Are you changing the libsdp.conf location or not?

Can you define "sanely"?

Scott

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi
> Sent: Monday, December 11, 2006 2:27 AM
> To: Michael S. Tsirkin
> Cc: Nimrod Gindi; OPENIB GENERAL
> Subject: Re: [openib-general] libsdp: RFC changing 
> libsdp.conf location
> 
> Hi Michael,
> 
> Thanks. This proposal is simple and clear to me.
> Let's wait a day and see if anybody else have other ideas.
> 
> Thanks
> 
> Eitan
> 
> Michael S. Tsirkin wrote:
> >> BTW: libsdp.conf used to be overwritten in previous install.
> >> I have fixed the nakefile to avoid that and instead create a
> >> new file with install date under the same directory.
> >>     
> >
> > Here's a simple proposal that will address this issue:
> > - Make libsdp behave sanely when not libsdp.conf file is present.
> >   Do not install anything in default location in make install.
> >
> > - in make install, copy the example configuration file into
> >   libsdp.conf.example. Add a line to the top of it saying
> >   "rename this file to libsdp.conf to make lbisdp use it".
> >
> >   
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From mst at mellanox.co.il  Mon Dec 11 09:34:59 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 19:34:59 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061211171601.GG465@sashak.voltaire.com>
References: <20061211152801.GC465@sashak.voltaire.com>
	<20061211152552.GD15870@mellanox.co.il>
	<20061211171601.GG465@sashak.voltaire.com>
Message-ID: <20061211173459.GC20344@mellanox.co.il>

> > the default is wrong in that case and need to be fixed.
> > Agree?
> 
> No (and already tried to explain why).

Can't say I get it.

-- 
MST


From tziporet at dev.mellanox.co.il  Mon Dec 11 09:40:30 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Mon, 11 Dec 2006 19:40:30 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <20061211173459.GC20344@mellanox.co.il>
References: <20061211152801.GC465@sashak.voltaire.com>
	<20061211152552.GD15870@mellanox.co.il>
	<20061211171601.GG465@sashak.voltaire.com>
	<20061211173459.GC20344@mellanox.co.il>
Message-ID: <457D980E.7030803@dev.mellanox.co.il>

Michael S. Tsirkin wrote:
>>> the default is wrong in that case and need to be fixed.
>>> Agree?
>>>       
>> No (and already tried to explain why).
>>     
>
> Can't say I get it.
>
>   
Sasha - look at Roland's reply
Not clear why are you against something that will make the work of all 
users easier.

Tziporet


From robert.j.woodruff at intel.com  Mon Dec 11 09:45:14 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 09:45:14 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>

Steve Wrote,
>NOTE:  This git tree is backed against Linus's tree and I merged up to
>his latest on 12/8.  So it's past 2.6.19 and now depends on changes
that
>are post 2.6.19 (the workqueue changes).

>Steve.

Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels
or do plan on providing backport patches for earlier kernels ?

woody


From rdreier at cisco.com  Mon Dec 11 09:45:26 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 11 Dec 2006 09:45:26 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061211082410.GB29276@mellanox.co.il> (Michael S.
	Tsirkin's message of "Mon, 11 Dec 2006 10:24:10 +0200")
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
	<20061210134137.GL29174@mellanox.co.il>
	<20061211082410.GB29276@mellanox.co.il>
Message-ID: <adad56q1g3t.fsf@cisco.com>

 > BTW, Roland, could you give me some indication on whether this
 > has a chance getting into 2.6.20? If yes I'll stop writing new code
 > and focus on polishing this.

I think we could probably merge it but maybe it's better to put it in
-mm for a cycle given that it's new and not too many people have
looked at it yet.  And I still haven't gotten comfortable with the way
CM is enabled.

 - R.


From ralph.campbel at qlogic.com  Mon Dec 11 10:02:26 2006
From: ralph.campbel at qlogic.com (Ralph Campbell)
Date: Mon, 11 Dec 2006 10:02:26 -0800 (PST)
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <457BD18D.7000403@voltaire.com>
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
Message-ID: <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>

> Ralph Campbell wrote:
>> This version of the patch fixes ipath_sg_dma_address() and
>> updates the comments for ipath_dma.c as Or Gerlitz
>> suggested.
>
>> This patch implements the interposing DMA mapping functions to allow
>> support for IOMMUs and remove the dependence on phys_to_virt() and
>> bus_to_virt().
>
> Ralph,
>
> The patch seems ready modulo the resolution of whether you implement the
> addresses returned by the ipath ib_dma_map_xxx code as keys into a SW
> IOTLB (which means you return dma_address_t and not u64 but assign it
> ipath semantics) or choose a different path to follow (ie assume the
> problem exists only under the unsupported by ipath 32bit / high-mem
> config, do nothing, etc) - what ever you set with Roland.
>
> Or.

I would like to see this last set of patches integrated as is.
I would like to get more experience with the current implementation
before extending it to support other configurations.


From mst at mellanox.co.il  Mon Dec 11 10:07:46 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 20:07:46 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <adad56q1g3t.fsf@cisco.com>
References: <adad56q1g3t.fsf@cisco.com>
Message-ID: <20061211180746.GD20344@mellanox.co.il>

>  > BTW, Roland, could you give me some indication on whether this
>  > has a chance getting into 2.6.20? If yes I'll stop writing new code
>  > and focus on polishing this.
> 
> I think we could probably merge it but maybe it's better to put it in
> -mm for a cycle given that it's new and not too many people have
> looked at it yet.

Hmm. People here in openib community don't seem to look at, or run -mm kernels, so
I don't think this will buy us much - it'll just create work for me.

No?

> And I still haven't gotten comfortable with the way CM is enabled.

Are you still worried someone might turn it on by default?

I'm actively looking at fixing multicast - it's just unlikely to be ready this
week. Enabling logic is a small part of the code - maybe code can be merged, and
enabling tweaked post -rc1?

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 10:14:29 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 20:14:29 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061211180746.GD20344@mellanox.co.il>
References: <adad56q1g3t.fsf@cisco.com> <20061211180746.GD20344@mellanox.co.il>
Message-ID: <20061211181429.GE20344@mellanox.co.il>

> >  > BTW, Roland, could you give me some indication on whether this
> >  > has a chance getting into 2.6.20? If yes I'll stop writing new code
> >  > and focus on polishing this.
> > 
> > I think we could probably merge it but maybe it's better to put it in
> > -mm for a cycle given that it's new and not too many people have
> > looked at it yet.
> 
> Hmm. People here in openib community don't seem to look at, or run -mm kernels, so
> I don't think this will buy us much - it'll just create work for me.
> 
> No?

And it's not like it's such a lot of code, either, is it?
So fixing it up even in major ways will be possible later in RC cycle.

-- 
MST


From mshefty at ichips.intel.com  Mon Dec 11 09:54:18 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 11 Dec 2006 09:54:18 -0800
Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work
 with proposed 2.6.20 kernel CMA
In-Reply-To: <457BDF15.6090608@voltaire.com>
References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com>
	<457BDF15.6090608@voltaire.com>
Message-ID: <457D9B4A.6010507@ichips.intel.com>

> patch made over your path, can you please queue this somewhere so it 
> will not be forgotten?

Can you just send a signed-off-by line?  I'll add the patch to the librdmacm 
multicast branch.

Thanks,
- Sean


From mshefty at ichips.intel.com  Mon Dec 11 10:20:29 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Mon, 11 Dec 2006 10:20:29 -0800
Subject: [openib-general] [PATCH] - ucma updates for miscdev changes
In-Reply-To: <1165788273.25243.8.camel@linux-q667.site>
References: <1165788273.25243.8.camel@linux-q667.site>
Message-ID: <457DA16D.3010604@ichips.intel.com>

> As part of merging up to linus's tree as of 12/8/2006, I had to change
> ucma.c to support changes in the miscdevice stuff.  Below is a patch for
> this.  In addition to this change, I had to merge your ucma patches to
> get them to apply.  Nothing functional changed, I don't think, but some
> of the changes in your tree are already in linus's tree, so those
> patches were ignored.  And one didn't apply cleanly and I had to fix it
> manually.    
> 
> You can see these changes including the patch below as a single patch in
> git://staging.openfabrics.org/~swise/cxgb3.git commit number:
> d1ac2e74680d61a5e87165e1c6b4cec44533f2bd.

Thanks - I'll take a look at this.  My intention is follow the same process that 
we had been following and keep my tree in sync with the latest kernel release 
only, unless I need a more updated branch.

- Sean


From sashak at voltaire.com  Mon Dec 11 10:40:19 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 20:40:19 +0200
Subject: [openib-general] cannot clone librdmacm
In-Reply-To: <457D980E.7030803@dev.mellanox.co.il>
References: <20061211152801.GC465@sashak.voltaire.com>
	<20061211152552.GD15870@mellanox.co.il>
	<20061211171601.GG465@sashak.voltaire.com>
	<20061211173459.GC20344@mellanox.co.il>
	<457D980E.7030803@dev.mellanox.co.il>
Message-ID: <20061211184019.GK465@sashak.voltaire.com>

On 19:40 Mon 11 Dec     , Tziporet Koren wrote:
> Michael S. Tsirkin wrote:
> >>>the default is wrong in that case and need to be fixed.
> >>>Agree?
> >>>      
> >>No (and already tried to explain why).
> >>    
> >
> >Can't say I get it.
> >
> >  
> Sasha - look at Roland's reply
> Not clear why are you against something that will make the work of all 
> users easier.

I'm absolutely not against something that will make the work of all
users easier.

I'm just against this specific thing - to make executable default
post-update hook template. Because I don't see this as significant
improvement for users, but OTOH it seems for me as sort of user's
repository ownership violation.

As user I would prefer to not have such "surprises" as hooks executed by
default.

Sasha


From swise at opengridcomputing.com  Mon Dec 11 10:34:28 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 Dec 2006 12:34:28 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>
Message-ID: <1165862068.4020.14.camel@stevo-desktop>

On Mon, 2006-12-11 at 09:45 -0800, Woodruff, Robert J wrote:
> Steve Wrote,
> >NOTE:  This git tree is backed against Linus's tree and I merged up to
> >his latest on 12/8.  So it's past 2.6.19 and now depends on changes
> that
> >are post 2.6.19 (the workqueue changes).
> 
> >Steve.
> 
> Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels
> or do plan on providing backport patches for earlier kernels ?
> 

I was really hoping it would work for both kernels, but now with the
workqueue changes, I'll have to think about a 2.6.19 patch.  However, my
top priority is getting this tested and into kernel.org...


Steve.


From robert.j.woodruff at intel.com  Mon Dec 11 10:52:13 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 10:52:13 -0800
Subject: [openib-general] [ANNOUNCE]OFA Component Early Integration Test Tree
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB418@orsmsx418.amr.corp.intel.com>

At SC'06 developer's summit in Tampa, we had some discussion of having
an
early integration-test tree (kind of like the MM tree) for early testing
of
new infiniband components. I have started to look at putting together
such a tree
that contains Sean's latest uCMA, local sa cache, multicast code, and
the
IPoIB_CM code, at least for my own testing. If others are interested in
trying this stuff out before it gets into OFED or the kernel.org tree,
they can clone my ofa_integration tree at,

~woody/scm/ofa-integration

The merged code is in the integration-test branch.

My merge script also creates a single patch from the integration-test 
branch using

git-diff linux-2.6.19 integration-test >
./infiniband-ofa-mmddyy-for-linux-2.6.19.patch

that people could take and just apply against a 
stock 2.6.19 kernel. These are in my top level directory,
~woody

Not sure if these patches would be useful to anyone else, but if they 
are, we could figure out a way to publish them for use by the
greater community, or people could always just generate the patch
themselves using git-diff. 

So far I have tested the 
~woody/infiniband-ofa-1207006-for-linux-2.6.19.patch 
which contains Sean's local_sa cache, uCMA, and the IPoIB_CM code.

The ~woody/infiniband-ofa-1211006-for-linux-2.6.19.patch has the loca_sa
cache,
uCMA, IPoIB_CM, and Sean Multicast code. I just created this one, the
merge 
went ok, but I have yet to test it.

If other people would like to add some new kernel code to this tree to
test
with other components that are under development, I can add your code.
What I need to do this is for you to publish a git tree based on 2.6.19
(the release kernel)
that has your code (and only your code) in a branch, such that one could
create a self contained patch that would apply to a stock 2.6.19 kernel
using
git-diff. e.g.,
git-diff linux-2.6.19 mybranch 
would create a self contained patch of your code. 
This should allow me to easily merge the code into my tree. 
If your git tree contains changes to other core infiniband code or is
based
on something other than the release 2.6.19 kernel.org kernel, I cannot
use it,
since I cannot easily merge it with the other experimental components
that 
are based on 2.6.19.

If people think that this would be of value to the wider community,
I will add something to the wiki to explain how to get the tree and the
userspace code that matches this kernel code.

woody
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061211/a14ae914/attachment.html>

From venkatesh.babu at 3leafnetworks.com  Mon Dec 11 11:03:28 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Mon, 11 Dec 2006 11:03:28 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165672098.26559.43885.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
	<1165628315.26559.12385.camel@hal.voltaire.com>
	<457A1E90.5040606@3leafnetworks.com>
	<1165666352.26559.39788.camel@hal.voltaire.com>
	<1165672098.26559.43885.camel@hal.voltaire.com>
Message-ID: <457DAB80.7010501@3leafnetworks.com>


 Yes, the problem is noticed on port 1 also. It is random. Sometimes 
with port 1 and sometimes with port 2.

 I will try with only one "port 1" subnet.

 VBabu

Hal Rosenstock wrote:

>On Sat, 2006-12-09 at 07:12, Hal Rosenstock wrote:
>  
>
>>One more thing:
>>
>>When you upgraded to OFED 1.2, did you build and install the management
>>libraries (libibcommon, libibumad are important here and libibmad for
>>diags) ?
>>    
>>
>
>Does the problem always occur on the "second" subnet (port 2's subnet)
>or does it ever occur on port 1's subnet ?
>
>Can you totally not configure the "port 1" subnet on all machines (and
>OpenSM on the port 1's where that runs) and see if it is reproducible ?
>
>Thanks.
>
>-- Hal
>
>  
>


From venkatesh.babu at 3leafnetworks.com  Mon Dec 11 11:14:00 2006
From: venkatesh.babu at 3leafnetworks.com (Venkatesh Babu)
Date: Mon, 11 Dec 2006 11:14:00 -0800
Subject: [openib-general] Unreliable OpemSM failover
In-Reply-To: <1165666352.26559.39788.camel@hal.voltaire.com>
References: <1164117837.4381.48210.camel@hal.voltaire.com>
	<456B7CC8.5060806@3leafnetworks.com>
	<1164674885.11808.760.camel@hal.voltaire.com>
	<4579E333.4000901@3leafnetworks.com>
	<1165617878.26559.4952.camel@hal.voltaire.com>
	<4579F8E6.3040604@3leafnetworks.com>
	<1165622233.26559.8108.camel@hal.voltaire.com>
	<457A0389.7030103@3leafnetworks.com>
	<1165625283.26559.10270.camel@hal.voltaire.com>
	<457A0B62.2060501@3leafnetworks.com>
	<1165628315.26559.12385.camel@hal.voltaire.com>
	<457A1E90.5040606@3leafnetworks.com>
	<1165666352.26559.39788.camel@hal.voltaire.com>
Message-ID: <457DADF8.7010002@3leafnetworks.com>

Hal Rosenstock wrote:

>I was interested in the one on Node1 when it appeared to be trying to
>exit (which it shouldn't be but is) and the other threads don't seem to
>terminate.
>  
>
  Let me see if I can reproduse it again. First thing I will capture the 
core file, so that it can be investigated later.

>  
>
>>  How do I findout the thread_state value ?
>>    
>>
>
>It's a variable in the SM structure (in the SM thread).
>  
>
  I found this variable in osm_vl15intf.h:osm_vl15_t. I will get this 
thread_state value next time.

>One more thing:
>
>When you upgraded to OFED 1.2, did you build and install the management
>libraries (libibcommon, libibumad are important here and libibmad for
>diags) ?
>  
>
  I upgraded from OFED 1.0 to OFED 1.1 (not OFED 1.2). I built all these 
libraries and installed it.

 VBabu


From robert.j.woodruff at intel.com  Mon Dec 11 11:11:27 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 11:11:27 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB4B5@orsmsx418.amr.corp.intel.com>

Steve Wrote. 

>I was really hoping it would work for both kernels, but now with the
>workqueue changes, I'll have to think about a 2.6.19 patch.  However,
my
>top priority is getting this tested and into kernel.org...


>Steve.

Understood. I would like to try to include this driver in my OFA early
integration-test tree, but to do so, I would need you to publish a 
branch based on 2.6.19, not linus's tree, since all of the rest of the
components are based on 2.6.19, rather than linus's current tree.
If you want this code in my early integration test tree so that others
in
the OFA community can give it a try before it goes to kernel.org, I
would be
willing to try to include it. If you don't have the time right now, I 
understand.


From swise at opengridcomputing.com  Mon Dec 11 11:13:06 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Mon, 11 Dec 2006 13:13:06 -0600
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>
Message-ID: <1165864386.6867.2.camel@stevo-desktop>

On Mon, 2006-12-11 at 09:45 -0800, Woodruff, Robert J wrote:
> Steve Wrote,
> >NOTE:  This git tree is backed against Linus's tree and I merged up to
> >his latest on 12/8.  So it's past 2.6.19 and now depends on changes
> that
> >are post 2.6.19 (the workqueue changes).
> 
> >Steve.
> 
> Do you plan on only supporting the Chelsio driver for 2.6.20+ kernels
> or do plan on providing backport patches for earlier kernels ?
> 
> woody

Hey Roland, is there a preferred way to handle this?  IE whats the best
was of keeping a 2.6.19 based patch set while also trying to merge your
patches into the latest from linus's tree? 

I guess I can create a branch with a HEAD at 2.6.19 and back-port my
latest patch set.  Is that the best way?  Maybe a for-ofed branch?

Steve.


From robert.j.woodruff at intel.com  Mon Dec 11 11:19:00 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 11:19:00 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB4E2@orsmsx418.amr.corp.intel.com>

Michael wrote,
 >> BTW, Roland, could you give me some indication on whether this
 >> has a chance getting into 2.6.20? If yes I'll stop writing new code
 >> and focus on polishing this.

>I think we could probably merge it but maybe it's better to put it in
>-mm for a cycle given that it's new and not too many people have
>looked at it yet.  And I still haven't gotten comfortable with the way
>CM is enabled.

>- R.

I think it might be good for others in the OFA community to try this out
before we decide it is ready for the kernel. I tried it out over the
weekend,
running Intel MPI over IPoIB_CM, and with default MTU settings,
it did not cause any problems
on my small 2 node cluster. Might be good however for someone to load
this up on a larger cluster and test it. I did notice that unless I made
the MTU really big (16K), there was not much benefit (if any) with the
default MTU size. 
I also noticed that when I set the MTU to 16K and ran some stressful MPI
tests,
that my system seemed to get un-responsive like IPoIB was taking up too
much
kernel memory. Thus, I think it best for others to play with this a bit
before it is submitted upstream.

my 2 cents,
woody


From divy at chelsio.com  Mon Dec 11 11:25:00 2006
From: divy at chelsio.com (Divy Le Ray)
Date: Mon, 11 Dec 2006 11:25:00 -0800
Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <1165851389.13419.3.camel@stevo-desktop>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<adafybn2i7n.fsf@cisco.com> <1165851389.13419.3.camel@stevo-desktop>
Message-ID: <457DB08C.8070709@chelsio.com>

Steve Wise wrote:
> On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote:
>   
>> I haven't seen any evidence of the corresponding ethernet NIC driver
>> being merged for 2.6.20 (which is a prerequisite, right).
>>
>> What's the status of that?
>>
>>     
>
> It is on its third or fourth round of review.  The last driver posted on
> 12/7, was merged up to linus's latest tree probably as of 12/7.  I know
> the comments set it was against 2.6.19, but it was really linus's
> latest.
>
> Divy, can you expand on this?
>   
Steve, the patch for the Chelsio T3 driver was postered against 
Linus'tree indeed.

-bash-3.00$ cat .git/refs/heads/origin
0215ffb08ce99e2bb59eca114a99499a4d06e704

It incorporated Stephen's feedback.
The comments I received since then concern minor coding style glitches.
I will fix them, the driver functionality should remain unchanged however.

Cheers,
Divy


>
> Steve.
>
>   


From robert.j.woodruff at intel.com  Mon Dec 11 11:40:35 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 11:40:35 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB543@orsmsx418.amr.corp.intel.com>

Woody wrote,
>I also noticed that when I set the MTU to 16K and ran some stressful
MPI
>tests,
>that my system seemed to get un-responsive like IPoIB was taking up too
>much
>kernel memory. 

Correction, I saw the strange behavior when I had the MPU set to 64K,
not 16K MTU,
and I cannot be sure that it was IPoIB_CM that was causing the problem,
so I think it would be good for others to give this some airtime and
report
their experiences to the list.

woody


From mst at mellanox.co.il  Mon Dec 11 11:41:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 21:41:11 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014EB4E2@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014EB4E2@orsmsx418.amr.corp.intel.com>
Message-ID: <20061211194111.GB27010@mellanox.co.il>

>  >> BTW, Roland, could you give me some indication on whether this
>  >> has a chance getting into 2.6.20? If yes I'll stop writing new code
>  >> and focus on polishing this.
> 
> >I think we could probably merge it but maybe it's better to put it in
> >-mm for a cycle given that it's new and not too many people have
> >looked at it yet.  And I still haven't gotten comfortable with the way
> >CM is enabled.
> 
> >- R.
> 
> I think it might be good for others in the OFA community to try this out before
> we decide it is ready for the kernel. I tried it out over the weekend, running
> Intel MPI over IPoIB_CM, and with default MTU settings, it did not cause any
> problems on my small 2 node cluster. Might be good however for someone to load
> this up on a larger cluster and test it.

IMO, we have after -rc1 to fix any bugs.
The feature *is* marked experimental after all, and have 0 impact
on code when disabled at compile time.
So if you want rock-stable, just turn it off.


> I did notice that unless I made the MTU
> really big (16K), there was not much benefit (if any) with the default MTU size.

Right. My observation too. The whole point of IPoIB CM
is to enable high MTU values. 64K is what works really well.

> I also noticed that when I set the MTU to 16K and ran some stressful MPI tests,
> that my system seemed to get un-responsive like IPoIB was taking up too much
> kernel memory.

Could you enable debug and try again? Maybe you have send errors.

My guess would be you are getting RQ underruns and QPs are getting closed and
reopened (and if DREQs are lost for some reason, which shouldn't happen on back
to back but seems to due to some issue in our MAD layer, we could be
getting stale connections which aren't currently cleaned up - it's on
my TODO).

I have a couple of ideas on how to fix it better - e.g. detect RNR NACK
and cycle the QP through RTS/INIT/RTR/RTS -
but the simplest workaround for now would be just to have a high MTU
or increase the RX ring size via IPoIB module option.

Can you try this too, and let me know?

> Thus, I think it best for others to play with this a bit before
> it is submitted upstream.
> 
> my 2 cents,
> woody

I don't know, really - it's an option after all.
Given that it doesn't cause problems for people that don't enable it,
keeping code out of kernel until it's totally robust seems wrong -
instead of debugging/fixing issues I'll have to spend time
keeping the code up to date with upstream.

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 11:44:43 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 21:44:43 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C014EB543@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C014EB543@orsmsx418.amr.corp.intel.com>
Message-ID: <20061211194443.GC27010@mellanox.co.il>

> >I also noticed that when I set the MTU to 16K and ran some stressful MPI tests,
> >that my system seemed to get un-responsive like IPoIB was taking up too
> >much kernel memory. 
> 
> Correction, I saw the strange behavior when I had the MPU set to 64K,
> not 16K MTU,
> and I cannot be sure that it was IPoIB_CM that was causing the problem,
> so I think it would be good for others to give this some airtime and
> report
> their experiences to the list.

That's the setup I'm mostly testing at I haven't seen this yet.

Are you running this together with Sean's multicast patches and the sa cache?

Are you seeing something in the log? What about when you
set debug_level to 1? Does increasing the RQ size help?

-- 
MST


From divy at chelsio.com  Mon Dec 11 11:49:53 2006
From: divy at chelsio.com (Divy Le Ray)
Date: Mon, 11 Dec 2006 11:49:53 -0800
Subject: [openib-general] [PATCH v3 00/13] 2.6.20 Chelsio T3 RDMA Driver
In-Reply-To: <457DB08C.8070709@chelsio.com>
References: <20061210223244.27166.36192.stgit@dell3.ogc.int>
	<adafybn2i7n.fsf@cisco.com> <1165851389.13419.3.camel@stevo-desktop>
	<457DB08C.8070709@chelsio.com>
Message-ID: <457DB661.6060102@chelsio.com>

Divy Le Ray wrote:
> Steve Wise wrote:
>> On Sun, 2006-12-10 at 20:02 -0800, Roland Dreier wrote:
>>  
>>> I haven't seen any evidence of the corresponding ethernet NIC driver
>>> being merged for 2.6.20 (which is a prerequisite, right).
>>>
>>> What's the status of that?
>>>
>>>     
>>
>> It is on its third or fourth round of review.  The last driver posted on
>> 12/7, was merged up to linus's latest tree probably as of 12/7.  I know
>> the comments set it was against 2.6.19, but it was really linus's
>> latest.
>>
>> Divy, can you expand on this?
>>   
> Steve, the patch for the Chelsio T3 driver was postered against 
> Linus'tree indeed.
>
> -bash-3.00$ cat .git/refs/heads/origin
> 0215ffb08ce99e2bb59eca114a99499a4d06e704
I meant
-bash-3.00$ cat .git/refs/heads/master
9eba2b0ba067ce9745e575e5ea2e97a5d7d61bef

>
> It incorporated Stephen's feedback.
> The comments I received since then concern minor coding style glitches.
> I will fix them, the driver functionality should remain unchanged 
> however.
>
> Cheers,
> Divy
>
>
>>
>> Steve.
>>
>>   
>
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


From eitan at mellanox.co.il  Mon Dec 11 12:31:17 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 11 Dec 2006 22:31:17 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FA71@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FA71@xmb-sjc-216.amer.cisco.com>
Message-ID: <457DC015.9050207@mellanox.co.il>

Hi Scott,

Scott Weitzenkamp (sweitzen) wrote:
> It's not clear to me.
>
> Are you changing the libsdp.conf location or not?
>   
Currently the only feedback I got say that I need not install 
libsdp.conf at all.
I only need to install an example somewhere (I do not know where - maybe 
docs?)
Instead I am going to change the default libsdp behavior to that of the 
default config.

Do you have some insight into this issue? Any preferences?

Thanks

EZ
> Can you define "sanely"?
>
> Scott
>
>   
>> -----Original Message-----
>> From: openib-general-bounces at openib.org 
>> [mailto:openib-general-bounces at openib.org] On Behalf Of Eitan Zahavi
>> Sent: Monday, December 11, 2006 2:27 AM
>> To: Michael S. Tsirkin
>> Cc: Nimrod Gindi; OPENIB GENERAL
>> Subject: Re: [openib-general] libsdp: RFC changing 
>> libsdp.conf location
>>
>> Hi Michael,
>>
>> Thanks. This proposal is simple and clear to me.
>> Let's wait a day and see if anybody else have other ideas.
>>
>> Thanks
>>
>> Eitan
>>
>> Michael S. Tsirkin wrote:
>>     
>>>> BTW: libsdp.conf used to be overwritten in previous install.
>>>> I have fixed the nakefile to avoid that and instead create a
>>>> new file with install date under the same directory.
>>>>     
>>>>         
>>> Here's a simple proposal that will address this issue:
>>> - Make libsdp behave sanely when not libsdp.conf file is present.
>>>   Do not install anything in default location in make install.
>>>
>>> - in make install, copy the example configuration file into
>>>   libsdp.conf.example. Add a line to the top of it saying
>>>   "rename this file to libsdp.conf to make lbisdp use it".
>>>
>>>   
>>>       
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>     
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From adit.262 at gmail.com  Mon Dec 11 12:55:34 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Mon, 11 Dec 2006 15:55:34 -0500
Subject: [openib-general] Configuring Guest VMs to use Infiniband interfaces
In-Reply-To: <d2ad857f0612111255p32890e40l1a73022db473d45a@mail.gmail.com>
References: <d2ad857f0612111255p32890e40l1a73022db473d45a@mail.gmail.com>
Message-ID: <d2ad857f0612111255p7fc5ff5dt93ef95809ec3353d@mail.gmail.com>

Hi,

Has anyone worked with the xen-smartio repository?

I was using it and had a few questions with regard to the
configuration of the guest VM configuration.:

1) Is there a special config line to assign IB virtual interface to guests?
if i say vif= [' '] in guest config, the interface shows up as eth0
and not ib0 in guest.
Ive changed the network script in network-bridge to use the IB
interface (ib1) as the bridge.
2) how does xen mux/demux over the IB interface? does it use same
ethernet bridging? If so how does one get it to work?

Thanks,
Adit Ranadive


From sashak at voltaire.com  Mon Dec 11 13:07:08 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 11 Dec 2006 23:07:08 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210215033.GC21155@sashak.voltaire.com>
References: <20061210215033.GC21155@sashak.voltaire.com>
Message-ID: <20061211210708.GA25052@sashak.voltaire.com>

On 23:50 Sun 10 Dec     , Sasha Khapyorsky wrote:
> Hi,
> 
> Recently I found this OFA 'Userspace Git Trees' downloading howto:
> 
> https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> 
> and thought that we could make it simpler for end-user to choose the
> "right" git tree just by adding one more series of symbolic links under
> /pub/scm. This links will point to the maintainer's "official" trees, and
> we will have only one such link per project.
> 
> So typical downloading howto for end-users will looks like:
> 
>   git clone git://staging.openfabrics.org/dapl
>   git clone git://staging.openfabrics.org/ibutils
>   git clone git://staging.openfabrics.org/imgen
>   ...
> 
> instead of
> 
>   git clone git://staging.openfabrics.org/~ardavis/dapl
>   git clone git://staging.openfabrics.org/~eitan/ibutils
>   git clone git://staging.openfabrics.org/~mst/imgen
>   ...
> 
> as it is now.
> 
> 
> To illustrate this I've added already couple of such symbolic links
> under /pub/scm and it is visible now via gitweb:
> 
>   http://staging.openfabrics.org/git
> 
> Comments, objections?

Don't see many supporters up to now so I'm going to remove this "demo"
soon. If anybody cares - this is the last call!

Sasha


From halr at voltaire.com  Mon Dec 11 12:59:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Dec 2006 15:59:23 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457AC99E.8050402@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<1165617195.26559.4435.camel@hal.voltaire.com>
	<457AC99E.8050402@mellanox.co.il>
Message-ID: <1165870759.21606.18477.camel@hal.voltaire.com>

On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> Hi Eitan,
> >>>
> >>> Just wanted to close the loop on the OpenSM issues of the last couple
> >>> days.
> >>>
> >>> 1. When can you supply an OpenSM verbose log for the InformInfo
> >>> subscribe problem you reported earlier today ? Failing that, I don't
> >>> know how to reproduce this.
> >>>   
> >>>       
> >> Attached
> >>     
> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure.

Any idea on when you will have a chance to look into this ?

> >>> 4. I encourage you to look at and comment on the OpenSM patches rather
> >>> than waiting for them to be in the tree.
> >>>   
> >>>       
> >> I am sure you did not mean to, but now I have to admit my limited skills 
> >> in catching bugs by reading patches :-( .
> >>     
> >
> > Not just read, but they are there to try out as well.
> >   
> I will need an automatic flow for that sake. I can not keep up with the 
> amount of patches manually.
> But I do not know how to automatically convert the mails into patches 
> into a tree.
> > You could try out the patches and do the same thing before they are
> > committed.
> >
> >   
> I have automation based on the committed tree that pull it (git trem) , 
> compile and run regression.
> Actually this is how all other code is handled too.

Are you referring to OFED ?

In the case of OFED, where do those "special" trees/branches come from ?

-- Hal


From jlentini at netapp.com  Mon Dec 11 13:09:47 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 11 Dec 2006 16:09:47 -0500 (EST)
Subject: [openib-general] Configuring Guest VMs to use Infiniband
 interfaces
In-Reply-To: <d2ad857f0612111255p7fc5ff5dt93ef95809ec3353d@mail.gmail.com>
References: <d2ad857f0612111255p32890e40l1a73022db473d45a@mail.gmail.com>
	<d2ad857f0612111255p7fc5ff5dt93ef95809ec3353d@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0612111607590.20796@jlentini-linux.nane.netapp.com>


On Mon, 11 Dec 2006, Adit Ranadive wrote:

> Has anyone worked with the xen-smartio repository?

Novell has made substantial improvements to the xen-smartio code. They 
made a presentation at the last workshop:

http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf


From robert.j.woodruff at intel.com  Mon Dec 11 13:11:21 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Mon, 11 Dec 2006 13:11:21 -0800
Subject: [openib-general] userspace git trees
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C014EB6AA@orsmsx418.amr.corp.intel.com>

Sasha wrote,
>> Comments, objections?

>Don't see many supporters up to now so I'm going to remove this "demo"
>soon. If anybody cares - this is the last call!

>Sasha

I don't have any preference either way is fine. 

woody


From sweitzen at cisco.com  Mon Dec 11 13:42:36 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 11 Dec 2006 13:42:36 -0800
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>

> > Are you changing the libsdp.conf location or not?
> >   
> Currently the only feedback I got say that I need not install 
> libsdp.conf at all.
> I only need to install an example somewhere (I do not know 
> where - maybe 
> docs?)
> Instead I am going to change the default libsdp behavior to 
> that of the 
> default config.
> 
> Do you have some insight into this issue? Any preferences?

I strongly disagree with not installing libsdp.conf at all.  On my RHEL4
system I count 57 /etc/*.conf files.  Most of these I have never
changed, yet they are useful references.

I'm OK with leaving libsdp.conf in /usr/local/ofed/etc.

How do other RPM packages with .conf file handle upgrading the .conf
file?

Scott


From mst at mellanox.co.il  Mon Dec 11 13:51:41 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 11 Dec 2006 23:51:41 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
Message-ID: <20061211215141.GB4235@mellanox.co.il>

> I strongly disagree with not installing libsdp.conf at all.

Just saying "I strongly disagree" does not make for a strong argument :)

Why do you (strongly) want it installed if libsdp will work fine without,
in a way identical to what it is doing with default libsdp.conf today?

-- 
MST


From rdreier at cisco.com  Mon Dec 11 14:02:34 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 11 Dec 2006 14:02:34 -0800
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
	(Scott Weitzenkamp's message of "Mon, 11 Dec 2006 13:42:36 -0800")
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
Message-ID: <ada8xhe1479.fsf@cisco.com>

 > How do other RPM packages with .conf file handle upgrading the .conf
 > file?

You mark the config file with %config or %config(noreplace) in the
spec file.  With %config, RPM will move the old config to .rpmsave (if
the old config was edited) and with %config(noreplace), RPM will put
the new config file in .rpmnew (if the old file was edited).

I definitely think RPM packages should install sane defaults into
their /etc/*.conf files.

As a side note it doesn't make any sense to me for OFED RPMs to put
stuff in /usr/local/ofed rather than the standard prefix.

 - R.


From sweitzen at cisco.com  Mon Dec 11 14:04:29 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 11 Dec 2006 14:04:29 -0800
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC66@xmb-sjc-216.amer.cisco.com>


> > I strongly disagree with not installing libsdp.conf at all.
> 
> Just saying "I strongly disagree" does not make for a strong 
> argument :)
> 
> Why do you (strongly) want it installed if libsdp will work 
> fine without,
> in a way identical to what it is doing with default libsdp.conf today?

On my RHEL4 system I count 57 /etc/*.conf files.  Most of these I have
never
changed, yet they are useful references.  This is more intiutive to me
than having to guess or search for how to configure libsdp.

We install libsdp.conf today, and I don't see any good reason to not
keep doing so.

Scott


From mst at mellanox.co.il  Mon Dec 11 14:03:12 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 00:03:12 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC40@xmb-sjc-216.amer.cisco.com>
Message-ID: <20061211220312.GB8725@mellanox.co.il>

> On my RHEL4
> system I count 57 /etc/*.conf files.  Most of these I have never
> changed, yet they are useful references.

We can have a file named libsdp.conf.example, with the first line:

# this is an example libsdp configuration file.
# to make it active, rename it libsdp.conf: mv libsdp.conf.example libsdp.conf

-- 
MST


From sweitzen at cisco.com  Mon Dec 11 14:07:42 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 11 Dec 2006 14:07:42 -0800
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC6D@xmb-sjc-216.amer.cisco.com>


> > On my RHEL4
> > system I count 57 /etc/*.conf files.  Most of these I have never
> > changed, yet they are useful references.
> 
> We can have a file named libsdp.conf.example, with the first line:
> 
> # this is an example libsdp configuration file.
> # to make it active, rename it libsdp.conf: mv 
> libsdp.conf.example libsdp.conf
> 
> -- 
> MST
> 

I this this is less useful than just having the .conf file there, and I
only see one example of this on RHEL4.

Scott


From halr at voltaire.com  Mon Dec 11 14:07:06 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 11 Dec 2006 17:07:06 -0500
Subject: [openib-general] openib-commits and git
Message-ID: <1165874816.21606.21357.camel@hal.voltaire.com>

Hi,

Some have requested the equivalent of what we had with svn with
openib-commits. 

The first question is what capabilities in this are desired. We don't
want to spend a lot of engineering time on this but it would be good to
know. Is a notification of the commit/push with the log sufficient or
does it need to look more what svn provided (and include the changes
too) ?

The other question is a policy one: Is it a reasonable default to enable
this for all the developers ? Do any of the developers object to this
policy ?

-- Hal


From sweitzen at cisco.com  Mon Dec 11 14:20:55 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Mon, 11 Dec 2006 14:20:55 -0800
Subject: [openib-general] openib-commits and git
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC83@xmb-sjc-216.amer.cisco.com>

I would like to see diffs, either inline in the commit email or via a
URL I can click on.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock
> Sent: Monday, December 11, 2006 2:07 PM
> To: openib-general at openib.org
> Cc: OpenFabricsEWG
> Subject: [openib-general] openib-commits and git
> 
> Hi,
> 
> Some have requested the equivalent of what we had with svn with
> openib-commits. 
> 
> The first question is what capabilities in this are desired. We don't
> want to spend a lot of engineering time on this but it would 
> be good to
> know. Is a notification of the commit/push with the log sufficient or
> does it need to look more what svn provided (and include the changes
> too) ?
> 
> The other question is a policy one: Is it a reasonable 
> default to enable
> this for all the developers ? Do any of the developers object to this
> policy ?
> 
> -- Hal
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From adit.262 at gmail.com  Mon Dec 11 14:24:21 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Mon, 11 Dec 2006 17:24:21 -0500
Subject: [openib-general] Configuring Guest VMs to use Infiniband
 interfaces
In-Reply-To: <Pine.LNX.4.64.0612111607590.20796@jlentini-linux.nane.netapp.com>
References: <d2ad857f0612111255p32890e40l1a73022db473d45a@mail.gmail.com>
	<d2ad857f0612111255p7fc5ff5dt93ef95809ec3353d@mail.gmail.com>
	<Pine.LNX.4.64.0612111607590.20796@jlentini-linux.nane.netapp.com>
Message-ID: <d2ad857f0612111424h5ec2f261y2c094bad6bf18b56@mail.gmail.com>

Novell is planning those changes unfortunately the source tree at
http://xenbits.xensource.com/ext/xen-smartio.hg is still abt 8 months
old..

Also are the mellanox 25208 hcas compatible with the 23208 ones? I
know that the guestVMs use the hca driver only for 23208..
Unfortunately I have the 25208 ones will they still work in domU?

Thanks,
Adit

On 12/11/06, James Lentini <jlentini at netapp.com> wrote:
>
>
> On Mon, 11 Dec 2006, Adit Ranadive wrote:
>
> > Has anyone worked with the xen-smartio repository?
>
> Novell has made substantial improvements to the xen-smartio code. They
> made a presentation at the last workshop:
>
> http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf
>


From jlentini at netapp.com  Mon Dec 11 14:51:00 2006
From: jlentini at netapp.com (James Lentini)
Date: Mon, 11 Dec 2006 17:51:00 -0500 (EST)
Subject: [openib-general] Configuring Guest VMs to use Infiniband
 interfaces
In-Reply-To: <d2ad857f0612111424h5ec2f261y2c094bad6bf18b56@mail.gmail.com>
References: <d2ad857f0612111255p32890e40l1a73022db473d45a@mail.gmail.com>
	<d2ad857f0612111255p7fc5ff5dt93ef95809ec3353d@mail.gmail.com>
	<Pine.LNX.4.64.0612111607590.20796@jlentini-linux.nane.netapp.com>
	<d2ad857f0612111424h5ec2f261y2c094bad6bf18b56@mail.gmail.com>
Message-ID: <Pine.LNX.4.64.0612111748530.20796@jlentini-linux.nane.netapp.com>


On Mon, 11 Dec 2006, Adit Ranadive wrote:

> Novell is planning those changes unfortunately the source tree at
> http://xenbits.xensource.com/ext/xen-smartio.hg is still abt 8 months
> old..
> 
> Also are the mellanox 25208 hcas compatible with the 23208 ones? 

They are compatible in "Tavor" compatibility mode.

> I know that the guestVMs use the hca driver only for 23208.. 
> Unfortunately I have the 25208 ones will they still work in domU?

I'm not sure if the xen-smartio tree supports this.

> Thanks,
> Adit
> 
> On 12/11/06, James Lentini <jlentini at netapp.com> wrote:
> > 
> > 
> > On Mon, 11 Dec 2006, Adit Ranadive wrote:
> > 
> > > Has anyone worked with the xen-smartio repository?
> > 
> > Novell has made substantial improvements to the xen-smartio code. They
> > made a presentation at the last workshop:
> > 
> > http://openfabrics.org/conference/nov2006sc/xen-ib-presentation.pdf


From mlleinin at hpcn.ca.sandia.gov  Mon Dec 11 15:20:56 2006
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Mon, 11 Dec 2006 15:20:56 -0800
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com>
References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com>
Message-ID: <1165879256.19459.379.camel@localhost>

On Tue, 2006-12-05 at 12:22 +0000, Eric Barton wrote:
> Hi,
> 
> We'd dearly like some help to understand why we seem to be having
> performance issues with OFED.  When we run a lustre network bandwidth
> benchmark, we find significant performance degradation on OFED versus
> Voltaire...
> 
>              Premap (256 RDMA frags)     Map on demand (1 RDMA frag)
>              Voltaire  OFED  Ratio       Voltaire  OFED  Ratio 
> Writes MB/s  682       567   83 %        577       436   75 %
> Reads MB/s   658       554   84 %        555       432   77 %

  Where these tests run on the same hardware setup?  If so was it PCI-X
or PCIe?  HCA firmware version would also be useful.

  Roland may be able to comment on if their are performance difference
for interrupt-drive CQ between the old VAPI stacks and OFED.

  At face value these results are troubling since we are starting to
move all of our IB clusters, that use Lustre, over to OFED.

  Thanks,

	- Matt

> 
> These tests measure the bandwidth of 1MByte transfers pipelined 8 deep.
> All hardware/software was the same, apart from the IB stack and the lustre
> network driver.
> 
> The architecture of the lustre network drivers for OFED and Voltaire are
> almost identical.  Both use RC QPs with the same control message protocol
> to set up bulk data transfers using RDMA WRITE.  Control messages use a
> credit flow protocol to ensure that they are only sent when buffers are
> posted to receive them.  Concurrent transfers over the same QP are
> supported so that lustre can pipeline bulk I/O.
> 
> The only difference between the lustre network drivers is that the Voltaire
> driver has a single global CQ and the OFED driver has 1 CQ per QP.  However
> the measurement above are for a single pair of nodes - in this case both
> implementations use a single CQ.
> 
> By default, the drivers pre-map all of physical memory so each RDMA
> consists of page fragments.  However, we can also compile both drivers to
> map on demand using FMR so that RDMA is not fragmented.  The results above
> compare both methods and although both drivers perform worse when mapping,
> the OFED driver takes the bigger hit.
> 
> We'd be delighted if anyone can shed any light or can suggest any steps we
> should take to discover the reason.  We're also very willing to provide
> assistance if any of the OpenFabrics developers wants to duplicate the
> setup.
> 


From sashak at voltaire.com  Mon Dec 11 16:09:11 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 Dec 2006 02:09:11 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061211054539.GL9205@mellanox.co.il>
References: <20061210233657.GB32199@sashak.voltaire.com>
	<20061211054539.GL9205@mellanox.co.il>
Message-ID: <20061212000911.GJ25052@sashak.voltaire.com>

On 07:48 Mon 11 Dec     , Michael S. Tsirkin wrote:
> > > > > > Recently I found this OFA 'Userspace Git Trees' downloading howto:
> > > > > > 
> > > > > > https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories
> > > > > > 
> > > > > > and thought that we could make it simpler for end-user to choose the
> > > > > > "right" git tree just by adding one more series of symbolic links under
> > > > > > /pub/scm. This links will point to the maintainer's "official" trees, and
> > > > > > we will have only one such link per project.
> > > > > > 
> > > > > > So typical downloading howto for end-users will looks like:
> > > > > > 
> > > > > >   git clone git://staging.openfabrics.org/dapl
> > > > > >   git clone git://staging.openfabrics.org/ibutils
> > > > > >   git clone git://staging.openfabrics.org/imgen
> > > > > >   ...
> > > > > > 
> > > > > > instead of
> > > > > > 
> > > > > >   git clone git://staging.openfabrics.org/~ardavis/dapl
> > > > > >   git clone git://staging.openfabrics.org/~eitan/ibutils
> > > > > >   git clone git://staging.openfabrics.org/~mst/imgen
> > > > > >   ...
> > > > > > 
> > > > > > as it is now.
> > > > > 
> > > > > NACK, please remove this. These soft links are messy, and
> > > > > the fact that one needs root just to add a tree shows just how the approach
> > > > > is broken.
> > > > 
> > > > No, it is not instead, but in addition to ~user/ links, so root is _not_
> > > > required to add tree.
> > > 
> > > right but suddenly root is needed to make it "official".
> > > Let's avoid the whole policy-setting-by-softlinks.
> > > "I have root" should not equal, or be required for "I say what's official".
> > 
> > What are you trying to avoid? That only sysadmin will decide which git
> > tree will be "official" for OFED and which will not?
> 
> Yes. Another point is that I should not need sysadmin priviledges to create
> a new project and declare my tree the official source.

Nothing prevents from you to do it. No?

In "worst" case we could make /pub/scm writable for dedicated group (like
'git') and add all git users to this group. I think this should be safe -
currently we have only symbolic links in this directory.

> But not only that - staging is used to develop more than just OFED.  Read
> the rant part in the original mail if you like for more detail - development
> trees should all be equal. Only releases should be official.  And release has an
> immutable name, so it does not *matter* which tree you get it from.

I don't understand how it is related. Currently we have the list of
"official" trees anyway in Wiki (as above):

https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OFA+git+Repositories

, and the goal is just to make it easier for end-users to find this.

> > > These should be branches, not separate trees.
> > 
> > Why not?
> 
> You seem to have a fear of branches :).

Of course not :). I like branches and I like trees, both can be useful.

> Many trees do not buy you anything,
> I tried this with ofed 1.1 in the beginning.

Your bad experience doesn't mean that multiple trees are bad idea -
you may find many good experiences as well (look at kernel.org for
example).

> You can have many trees. But a single project maintained by a single person
> belongs in a single public tree, scattering it around between multiple trees
> just makes it messy for people to track, and messy to figure out the delta
> between branches.

In the "rant" above you talked about equal development trees, I guess
"multiple"? What about multiple projects maintained by single person,
and single project maintained by multiple persons, and experimental
features of some existing project maintained but yet another person...

> Finally, it wastes space.

'git-clone -s' helps to save space.


Anyway I don't think that my proposition is so "Great Idea" (and
requires such fundamental discussion as branch against tree :)), but just
small helpful thing, mainly end-user oriented. And since there is no
strong support for this, I'm removing this now.

Sasha


From rdreier at cisco.com  Mon Dec 11 16:16:21 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 11 Dec 2006 16:16:21 -0800
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
	(Ralph Campbell's message of "Mon, 11 Dec 2006 10:02:26 -0800 (PST)")
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
	<50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
Message-ID: <adar6v6ynmy.fsf@cisco.com>

 > I would like to see this last set of patches integrated as is.
 > I would like to get more experience with the current implementation
 > before extending it to support other configurations.

Yeah, let's go with that.  Since ipath depends on 64BIT in Kconfig
anyway I think this is OK for now.

 - R.


From rdreier at cisco.com  Mon Dec 11 16:17:50 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Mon, 11 Dec 2006 16:17:50 -0800
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <076a01c71cb0$244a7630$0281a8c0@ebpc> (Eric Barton's
	message of "Sun, 10 Dec 2006 23:08:45 -0000")
References: <076a01c71cb0$244a7630$0281a8c0@ebpc>
Message-ID: <adamz5uynkh.fsf@cisco.com>

 > > No other kernel subsystem has one, so I don't think it's realistic to
 > > expect one for IB.

 > Don't you think it would be useful?  Even if only to make API changes
 > explicit?

Sure, I admit it would be useful for out-of-tree code.  But it would
also be an unmaintainable mess to actually try and have a set of
feature flags, so I don't think we can do it.

 - R.


From poknam at gmail.com  Mon Dec 11 17:24:53 2006
From: poknam at gmail.com (PN)
Date: Tue, 12 Dec 2006 09:24:53 +0800
Subject: [openib-general] Automatically connect to SRP target
In-Reply-To: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>
References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>
Message-ID: <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com>

No one can help me? :(

PN


2006/12/7, Lai Dragonfly <poknam at gmail.com>:
>
> Hi all,
>
> i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in clients and
> IBGD-1.8.2-srpt in targets.
> i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in
> openib.conf,
> i could not found the SRP target.
> until i execute "srp_daemon -e -o", i can see all the targets appear in
> /dev/sdX.
>
> since i want to export the targets to other nodes,
> any idea so that i can connect to the targets automatically in each
> reboot.
> without typing "srp_daemon -e -o" each time?
>
> thanks in advance.
>
> PN
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/bb183381/attachment.html>

From vuhuong at mellanox.com  Mon Dec 11 17:25:42 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 11 Dec 2006 17:25:42 -0800
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
Message-ID: <457E0516.2050009@mellanox.com>

James Lentini wrote:
> A couple of questions Vu:
> 
> What NFS-RDMA release are you using? This looks like release 7.
> 

Yes. I'm using release 7

> Is this reproducible?

I ran into it twice - I think that it may co-relate to 
openSM restart incident. I'll double check it and confirm


> 
> What kernel version are you using?

2.6.18.5

> 
> What hardware is this on? It looks like x86-64 to me, which is fine. I 
> just want to be sure I know what I'm looking at. As many specifics as 
> possible is good (number of CPUs, hyperthreading, etc.)
> 

Dual woodcrest xeon based CPUs

> Could you send the output of 
> 
> objdump -Slr /path/to/kernel/mm/swap.o
> 

I attached the objdump output here

> Actually, just the put_page disassembly is all I want to see.
> 
> Is there any more text available? Usually there is an explanation 
> given for an oops message (e.g. "Unable to handle kernel paging 
> request..").
> 

I did not see any oops text message. System was still 
responsive with ipoib ping or login


> I opened a bug at the NFS-RDMA SourceForge project to track this:
> 
> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583

thanks for your help,

-vu

> 
> Thanks for reporting this.
> james
> 
> On Fri, 8 Dec 2006, Vu Pham wrote:
> 
>> Hi James,
>>   I got these errors in server's /var/log/messages and then the server stop
>> responding to login, I/O...; however, the server is still up, ipoib is still
>> working
>>
>>
>> Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
>> [<ffffffff8025dff7>] put_page+0x17/0x40
>> Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 00010246
>> Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001
>> RCX: 000000000003ffff
>> Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001
>> RDI: ffff8102274e92f8
>> Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034
>> R09: 0000000000000000
>> Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000
>> R12: ffff81020ef96800
>> Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000
>> R15: ffff8102053ee890
>> Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
>> GS:ffff81022066eb40(0000) knlGS:0000000000000000
>> Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000
>> CR4: 00000000000006e0
>> Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
>> ffff810219dde000, task ffff81020d87f0c0)
>> Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 ffff81020ef96968
>> ffff81020ef96800 ffff81020ef96958
>> Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
>> ffffffff80424e05 0000000000000000
>> Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
>> ffffffff80239b90 ffff81020d87f0c0
>> Dec  8 06:38:21 ibd201 kernel: Call Trace:
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
>> :sunrpc:svc_rdma_put_context+0x37/0xd0
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>> schedule_timeout+0x95/0xb0
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] process_timeout+0x0/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
>> wait_for_completion_timeout+0xcd/0x150
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>> default_wake_function+0x0/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
>> :ib_mthca:mthca_cmd_post+0x232/0x260
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>> default_wake_function+0x0/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] __next_cpu+0x19/0x30
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
>> find_busiest_group+0x24e/0x6d0
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] thread_return+0x0/0xde
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
>> _spin_unlock_irqrestore+0x8/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
>> try_to_del_timer_sync+0x51/0x60
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] del_timer_sync+0xc/0x20
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>> schedule_timeout+0x95/0xb0
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
>> :sunrpc:svc_recv+0x416/0x510
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>> default_wake_function+0x0/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>> default_wake_function+0x0/0x10
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] :nfsd:nfsd+0x111/0x380
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] child_rip+0xa/0x12
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] child_rip+0x0/0x12
>> Dec  8 06:38:21 ibd201 kernel:
>> Dec  8 06:38:21 ibd201 kernel:
>> Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08
>> 0f 94 c0 84 c0 74
>> Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] put_page+0x17/0x40
>> Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
>>
>> -vu
>>


From vuhuong at mellanox.com  Mon Dec 11 17:31:08 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 11 Dec 2006 17:31:08 -0800
Subject: [openib-general] Automatically connect to SRP target
In-Reply-To: <92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com>
References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>
	<92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com>
Message-ID: <457E065C.6030104@mellanox.com>

PN,
   Edit file /etc/infiniband/openib.conf and set

SRPHA_ENABLE=yes

this will start srp_daemon by default

-vu

> No one can help me? :(
>  
> PN
> 
>  
> 2006/12/7, Lai Dragonfly <poknam at gmail.com <mailto:poknam at gmail.com>>:
> 
>     Hi all,
>      
>     i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in
>     clients and
>     IBGD-1.8.2-srpt in targets.
>     i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in
>     openib.conf,
>     i could not found the SRP target.
>     until i execute "srp_daemon -e -o", i can see all the targets appear
>     in /dev/sdX.
>      
>     since i want to export the targets to other nodes,
>     any idea so that i can connect to the targets automatically in each
>     reboot.
>     without typing "srp_daemon -e -o" each time?
>      
>     thanks in advance.
>      
>     PN
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From vuhuong at mellanox.com  Mon Dec 11 17:32:10 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Mon, 11 Dec 2006 17:32:10 -0800
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <457E0516.2050009@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com>
Message-ID: <457E069A.4020807@mellanox.com>

Hit *send* too soon - here is the objdump of swap.o

-vu


> James Lentini wrote:
>> A couple of questions Vu:
>>
>> What NFS-RDMA release are you using? This looks like release 7.
>>
> 
> Yes. I'm using release 7
> 
>> Is this reproducible?
> 
> I ran into it twice - I think that it may co-relate to 
> openSM restart incident. I'll double check it and confirm
> 
> 
>> What kernel version are you using?
> 
> 2.6.18.5
> 
>> What hardware is this on? It looks like x86-64 to me, which is fine. I 
>> just want to be sure I know what I'm looking at. As many specifics as 
>> possible is good (number of CPUs, hyperthreading, etc.)
>>
> 
> Dual woodcrest xeon based CPUs
> 
>> Could you send the output of 
>>
>> objdump -Slr /path/to/kernel/mm/swap.o
>>
> 
> I attached the objdump output here
> 
>> Actually, just the put_page disassembly is all I want to see.
>>
>> Is there any more text available? Usually there is an explanation 
>> given for an oops message (e.g. "Unable to handle kernel paging 
>> request..").
>>
> 
> I did not see any oops text message. System was still 
> responsive with ipoib ping or login
> 
> 
>> I opened a bug at the NFS-RDMA SourceForge project to track this:
>>
>> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583
> 
> thanks for your help,
> 
> -vu
> 
>> Thanks for reporting this.
>> james
>>
>> On Fri, 8 Dec 2006, Vu Pham wrote:
>>
>>> Hi James,
>>>   I got these errors in server's /var/log/messages and then the server stop
>>> responding to login, I/O...; however, the server is still up, ipoib is still
>>> working
>>>
>>>
>>> Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
>>> [<ffffffff8025dff7>] put_page+0x17/0x40
>>> Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 00010246
>>> Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 0000000000000001
>>> RCX: 000000000003ffff
>>> Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 0000000000000001
>>> RDI: ffff8102274e92f8
>>> Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 0000000000000034
>>> R09: 0000000000000000
>>> Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 0000000000000000
>>> R12: ffff81020ef96800
>>> Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 0000000000000000
>>> R15: ffff8102053ee890
>>> Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
>>> GS:ffff81022066eb40(0000) knlGS:0000000000000000
>>> Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>> Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 000000021c22b000
>>> CR4: 00000000000006e0
>>> Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
>>> ffff810219dde000, task ffff81020d87f0c0)
>>> Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 ffff81020ef96968
>>> ffff81020ef96800 ffff81020ef96958
>>> Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
>>> ffffffff80424e05 0000000000000000
>>> Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
>>> ffffffff80239b90 ffff81020d87f0c0
>>> Dec  8 06:38:21 ibd201 kernel: Call Trace:
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
>>> :sunrpc:svc_rdma_put_context+0x37/0xd0
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
>>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>> schedule_timeout+0x95/0xb0
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] process_timeout+0x0/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
>>> wait_for_completion_timeout+0xcd/0x150
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>> default_wake_function+0x0/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
>>> :ib_mthca:mthca_cmd_post+0x232/0x260
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>> default_wake_function+0x0/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] __next_cpu+0x19/0x30
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
>>> find_busiest_group+0x24e/0x6d0
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] thread_return+0x0/0xde
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
>>> _spin_unlock_irqrestore+0x8/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
>>> try_to_del_timer_sync+0x51/0x60
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] del_timer_sync+0xc/0x20
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>> schedule_timeout+0x95/0xb0
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
>>> :sunrpc:svc_recv+0x416/0x510
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>> default_wake_function+0x0/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>> default_wake_function+0x0/0x10
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] :nfsd:nfsd+0x111/0x380
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] child_rip+0xa/0x12
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] :nfsd:nfsd+0x0/0x380
>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] child_rip+0x0/0x12
>>> Dec  8 06:38:21 ibd201 kernel:
>>> Dec  8 06:38:21 ibd201 kernel:
>>> Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 f0 ff 4f 08
>>> 0f 94 c0 84 c0 74
>>> Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] put_page+0x17/0x40
>>> Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
>>>
>>> -vu
>>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: swap.objdump
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061211/a4d86fae/attachment.ksh>

From poknam at gmail.com  Mon Dec 11 18:41:08 2006
From: poknam at gmail.com (PN)
Date: Tue, 12 Dec 2006 10:41:08 +0800
Subject: [openib-general] Automatically connect to SRP target
In-Reply-To: <457E065C.6030104@mellanox.com>
References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>
	<92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com>
	<457E065C.6030104@mellanox.com>
Message-ID: <92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com>

Hi Vu,

i have 2 more questions,
now i have 3 srp targets and use LVM to form a GFS system.

after setting SRPHA_ENABLE=yes, i found that sometimes (~30%) it will miss a
target during reboot.
i need to manually type "srp_daemon -e -o" to discover the missing target.
is there any method such that the srp_daemon will repeat to try to ensure
all targets were found?

also, currently there is only 1 cable connect to each dual ports client.
is it normal to have the following messages?
Dec 12 10:18:10 storage02 run_srp_daemon[5471]: starting srp_daemon:
[HCA=mthca0] [port=2]
Dec 12 10:18:13 storage02 run_srp_daemon[5483]: failed srp_daemon:
[HCA=mthca0] [port=2] [exit status=0]
Dec 12 10:18:43 storage02 run_srp_daemon[5489]: starting srp_daemon:
[HCA=mthca0] [port=2]
Dec 12 10:18:46 storage02 run_srp_daemon[5501]: failed srp_daemon:
[HCA=mthca0] [port=2] [exit status=0]
.....[repeat infinitely]


Thanks a lot,
PN


Below is the log:

Dec 12 10:17:18 storage02 network: Setting network parameters:  succeeded
Dec 12 10:17:18 storage02 network: Bringing up loopback interface:
succeeded
Dec 12 10:17:23 storage02 network: Bringing up interface eth0:  succeeded
Dec 12 10:17:23 storage02 network: Bringing up interface ib0:  succeeded
Dec 12 10:17:26 storage02 kernel:   REJ reason 0xa
Dec 12 10:17:26 storage02 kernel: ib_srp: Connection failed
Dec 12 10:17:26 storage02 kernel: scsi3 : SRP.T10:00D0680000000578
Dec 12 10:17:26 storage02 kernel:   Vendor: Mellanox  Model:
IBSRP10-TGT       Rev: 1.46
Dec 12 10:17:26 storage02 kernel:   Type:
Direct-Access                      ANSI SCSI revision: 03
Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte hdwr
sectors (81964 MB)
Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back
Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte hdwr
sectors (81964 MB)
Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back
Dec 12 10:17:26 storage02 rpcidmapd: rpc.idmapd startup succeeded
Dec 12 10:17:26 storage02 kernel:  sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7
>
Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdb at scsi3, channel
0, id 0, lun 0
Dec 12 10:17:26 storage02 kernel: scsi4 : SRP.T10:00D06800000007B2
Dec 12 10:17:26 storage02 kernel:   Vendor: Mellanox  Model: IBSRP10-TGT
hy-b  Rev: 1.46
Dec 12 10:17:26 storage02 kernel:   Type:
Direct-Access                      ANSI SCSI revision: 03
Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte hdwr
sectors (81964 MB)
Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back
Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte hdwr
sectors (81964 MB)
Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back
Dec 12 10:17:26 storage02 kernel:  sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 >
Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdc at scsi4, channel
0, id 0, lun 0
Dec 12 10:17:26 storage02 scsi.agent[3668]: disk at
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/host3/target3:0:0/3:0:0:0
Dec 12 10:17:26 storage02 scsi.agent[3705]: disk at
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/host4/target4:0:0/4:0:0:0
Dec 12 10:17:26 storage02 ccsd[3769]: Starting ccsd 1.0.7:
Dec 12 10:17:26 storage02 ccsd[3769]:  Built: Aug 26 2006 15:01:49
Dec 12 10:17:26 storage02 ccsd[3769]:  Copyright (C) Red Hat, Inc.  2004
All rights reserved.
Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 10
Dec 12 10:17:26 storage02 kernel: Disabled Privacy Extensions on device
ffffffff80405540(lo)
Dec 12 10:17:26 storage02 kernel: IPv6 over IPv4 tunneling driver
Dec 12 10:17:26 storage02 ccsd:  succeeded
Dec 12 10:17:26 storage02 kernel: CMAN 2.6.9-45.4.centos4 (built Aug 26 2006
14:55:55) installed
Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 30
Dec 12 10:17:26 storage02 kernel: DLM 2.6.9-42.12.centos4 (built Aug 27 2006
05:25:40) installed
Dec 12 10:17:27 storage02 ccsd[3769]: cluster.conf (cluster name =
GFS_Cluster, version = 21) found.
Dec 12 10:17:27 storage02 ccsd[3769]: Unable to perform sendto: Cannot
assign requested address
Dec 12 10:17:27 storage02 run_srp_daemon[3845]: failed srp_daemon:
[HCA=mthca0] [port=2] [exit status=0]
Dec 12 10:17:28 storage02 run_srp_daemon[3851]: starting srp_daemon:
[HCA=mthca0] [port=2]
Dec 12 10:17:29 storage02 ccsd[3769]: Remote copy of cluster.conf is from
quorate node.
Dec 12 10:17:29 storage02 ccsd[3769]:  Local version # : 21
Dec 12 10:17:29 storage02 ccsd[3769]:  Remote version #: 21
Dec 12 10:17:29 storage02 kernel: CMAN: Waiting to join or form a
Linux-cluster
Dec 12 10:17:29 storage02 kernel: CMAN: sending membership request
Dec 12 10:17:29 storage02 ccsd[3769]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.7.1
Dec 12 10:17:29 storage02 ccsd[3769]: Initial status:: Inquorate
Dec 12 10:17:30 storage02 kernel: CMAN: got node storage01
Dec 12 10:17:30 storage02 kernel: CMAN: got node storage03
Dec 12 10:17:30 storage02 kernel: CMAN: quorum regained, resuming activity
Dec 12 10:17:30 storage02 ccsd[3769]: Cluster is quorate.  Allowing
connections.
Dec 12 10:17:30 storage02 cman: startup succeeded
Dec 12 10:17:30 storage02 lock_gulmd: no <gulm> section detected in
/etc/cluster/cluster.conf succeeded
Dec 12 10:17:31 storage02 fenced: startup succeeded
Dec 12 10:17:31 storage02 run_srp_daemon[4196]: failed srp_daemon:
[HCA=mthca0] [port=2] [exit status=0]
Dec 12 10:17:33 storage02 run_srp_daemon[4224]: starting srp_daemon:
[HCA=mthca0] [port=2]
Dec 12 10:17:36 storage02 run_srp_daemon[4236]: failed srp_daemon:
[HCA=mthca0] [port=2] [exit status=0]
Dec 12 10:17:40 storage02 run_srp_daemon[4242]: starting srp_daemon:
[HCA=mthca0] [port=2]
Dec 12 10:17:42 storage02 clvmd: Cluster LVM daemon started - connected to
CMAN
Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 on
node storage01
Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 on
node storage03
Dec 12 10:17:42 storage02 clvmd: clvmd startup succeeded
Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid
'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes for
volume group gfsvg.
Dec 12 10:17:42 storage02 vgchange:
Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid
'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes for
volume group gfsvg.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid
'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes for
volume group gfsvg.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid
'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes for
volume group gfsvg.
Dec 12 10:17:42 storage02 vgchange:   Volume group "gfsvg" not found
Dec 12 10:17:42 storage02 clvmd: Activating VGs: failed
Dec 12 10:17:42 storage02 netfs: Mounting other filesystems:  succeeded
Dec 12 10:17:42 storage02 kernel: Lock_Harness 2.6.9-58.2.centos4 (built Aug
27 2006 05:27:43) installed
Dec 12 10:17:42 storage02 kernel: GFS 2.6.9-58.2.centos4 (built Aug 27 2006
05:28:00) installed
Dec 12 10:17:42 storage02 mount: mount: special device /dev/gfsvg/gfslv does
not exist
Dec 12 10:17:42 storage02 gfs: Mounting GFS filesystems:  failed
Dec 12 10:17:42 storage02 kernel: i2c /dev entries driver
.....


2006/12/12, Vu Pham <vuhuong at mellanox.com>:
>
> PN,
>   Edit file /etc/infiniband/openib.conf and set
>
> SRPHA_ENABLE=yes
>
> this will start srp_daemon by default
>
> -vu
>
> > No one can help me? :(
> >
> > PN
> >
> >
> > 2006/12/7, Lai Dragonfly <poknam at gmail.com <mailto:poknam at gmail.com>>:
> >
> >     Hi all,
> >
> >     i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in
> >     clients and
> >     IBGD-1.8.2-srpt in targets.
> >     i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in
> >     openib.conf,
> >     i could not found the SRP target.
> >     until i execute "srp_daemon -e -o", i can see all the targets appear
> >     in /dev/sdX.
> >
> >     since i want to export the targets to other nodes,
> >     any idea so that i can connect to the targets automatically in each
> >     reboot.
> >     without typing "srp_daemon -e -o" each time?
> >
> >     thanks in advance.
> >
> >     PN
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/e11db3c5/attachment.html>

From vishal at endace.com  Mon Dec 11 20:51:49 2006
From: vishal at endace.com (vishal)
Date: Tue, 12 Dec 2006 17:51:49 +1300
Subject: [openib-general] srp initiator device discovery
In-Reply-To: <mailman.374.1165886944.18259.openib-general@openib.org>
References: <mailman.374.1165886944.18259.openib-general@openib.org>
Message-ID: <1165899109.14308.9.camel@julia.et.endace.com>

Hi,

   I have srp initiator installed with OFED-1.1, and another machine
with SRP target (IBGOLD). I started the srp daemon to discover the
target devices, and then ran fdisk -l to see the list. The list (below)
shows duplicate devices :-

Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn't contain a valid partition table

Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System

Disk /dev/sdd: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1          13      104391   83  Linux
/dev/sdd2              14       60801   488279610   8e  Linux LVM

Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sde doesn't contain a valid partition table

Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes
255 heads, 63 sectors/track, 267349 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System

Disk /dev/sdg: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdg1   *           1          13      104391   83  Linux
/dev/sdg2              14       60801   488279610   8e  Linux LVM


Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious).

I also tested the device discovery after creating an md device on the
target side, and found that the initiator doesn't take into account the
presence of an md device. Is this the expected behaviour ?

Thanks for your time!

Vishal


From mst at mellanox.co.il  Mon Dec 11 21:42:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 07:42:23 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061212000911.GJ25052@sashak.voltaire.com>
References: <20061210233657.GB32199@sashak.voltaire.com>
	<20061211054539.GL9205@mellanox.co.il>
	<20061212000911.GJ25052@sashak.voltaire.com>
Message-ID: <20061212054223.GB11064@mellanox.co.il>

Sasha, one small request: could you please fix description for your trees?
It should hopefully say something like "mirror of svn for <path>".

Thanks very much,
       MST

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 21:46:36 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 07:46:36 +0200
Subject: [openib-general] ~bos/ipathverbs
Message-ID: <20061212054636.GC11064@mellanox.co.il>

Bryan, could you please change the description for your tree?
gitweb summary page only shows first 3 words, so it now says
"Userspace Infiniband verbs ..."
and this does not make it clear its' not a generic verbs tree.

Can you make it "Qlogic ipath userspace support", or something in that style,
please?

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 21:58:41 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 07:58:41 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061212000911.GJ25052@sashak.voltaire.com>
References: <20061210233657.GB32199@sashak.voltaire.com>
	<20061211054539.GL9205@mellanox.co.il>
	<20061212000911.GJ25052@sashak.voltaire.com>
Message-ID: <20061212055841.GD11064@mellanox.co.il>

> > Finally, it wastes space.
> 
> 'git-clone -s' helps to save space.

BTW, be careful with that: it seems clone -s might lose your data if the repository
you clone from removes some heads and prunes history.
So it's only safe to clone in this way from Linus who knows never to do this :)

-- 
MST


From mst at mellanox.co.il  Mon Dec 11 22:03:34 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 08:03:34 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <1165879256.19459.379.camel@localhost>
References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com>
	<1165879256.19459.379.camel@localhost>
Message-ID: <20061212060334.GE11064@mellanox.co.il>

> > Hi,
> > 
> > We'd dearly like some help to understand why we seem to be having
> > performance issues with OFED.  When we run a lustre network bandwidth
> > benchmark, we find significant performance degradation on OFED versus
> > Voltaire...
> > 
> >              Premap (256 RDMA frags)     Map on demand (1 RDMA frag)
> >              Voltaire  OFED  Ratio       Voltaire  OFED  Ratio 
> > Writes MB/s  682       567   83 %        577       436   75 %
> > Reads MB/s   658       554   84 %        555       432   77 %
> 
>   Where these tests run on the same hardware setup?  If so was it PCI-X
> or PCIe?  HCA firmware version would also be useful.

Good point, Matt, thanks!
This gives me an idea: try loading mthca with tune_pci=1.
If this helps, this is a BIOS issue.

-- 
MST


From ramachandra.kuchimanchi at qlogic.com  Mon Dec 11 22:28:05 2006
From: ramachandra.kuchimanchi at qlogic.com (Ramachandra Kuchimanchi)
Date: Tue, 12 Dec 2006 00:28:05 -0600
Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover
 from secondary path back to primary path
In-Reply-To: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com>
References: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com>
Message-ID: <C07C40DB2364324799506DE8FF12F8D81A125F@EPEXCH1.qlogic.org>

Roland,

Did you get a chance to look at these patches ?

Regards,
Ram

> -----Original Message-----
> From: openib-general-bounces at openib.org [mailto:openib-general-
> bounces at openib.org] On Behalf Of Ramachandra K
> Sent: Thursday, December 07, 2006 4:33 PM
> To: Roland Dreier
> Cc: Openib-General
> Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover
from
> secondary path back to primary path
> 
> This fixes a bug due to which failover from secondary path back to
primary path
> was not working.
> 
> Signed-off-by: Ramachandra K <ramachandra.kuchimanchi at qlogic.com>
> ---
> 
>  drivers/infiniband/ulp/vnic/vnic_ib.c   |    4 +++-
>  drivers/infiniband/ulp/vnic/vnic_main.c |    9 +++++----
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/vnic/vnic_ib.c
> b/drivers/infiniband/ulp/vnic/vnic_ib.c
> index 6196e20..56ae9f7 100644
> --- a/drivers/infiniband/ulp/vnic/vnic_ib.c
> +++ b/drivers/infiniband/ulp/vnic/vnic_ib.c
> @@ -303,10 +303,12 @@ int vnic_ib_get_path(struct netpath *net
>  			       " path record query\n",
>  			       config->path_info.status);
> 
> -		netpath_timer(netpath, vnic->config->no_path_timeout);
>  		ret = config->path_info.status;
>  	}
>  out:
> +	if (ret)
> +		netpath_timer(netpath, vnic->config->no_path_timeout);
> +
>  	return ret;
>  }
> 
> diff --git a/drivers/infiniband/ulp/vnic/vnic_main.c
> b/drivers/infiniband/ulp/vnic/vnic_main.c
> index fca2b90..e15d3f9 100644
> --- a/drivers/infiniband/ulp/vnic/vnic_main.c
> +++ b/drivers/infiniband/ulp/vnic/vnic_main.c
> @@ -710,17 +710,18 @@ static struct vnic * vnic_handle_npevent
>  	case VNIC_PRINP_TIMEREXPIRED:
>  		netpath = &vnic->primary_path;
>  		netpath->timer_state = NETPATH_TS_EXPIRED;
> -		if (netpath->carrier)
> +		if (!netpath->carrier)
>  			update_path_and_reconnect(netpath, vnic);
>  		break;
>  	case VNIC_SECNP_TIMEREXPIRED:
>  		netpath = &vnic->secondary_path;
>  		netpath->timer_state = NETPATH_TS_EXPIRED;
> -		if (netpath->carrier) {
> +		if (!netpath->carrier)
> +			update_path_and_reconnect(netpath, vnic);
> +		else {
>  			if (vnic->state == VNIC_UNINITIALIZED)
>  				vnic_npevent_register(vnic, netpath);
> -		} else
> -			update_path_and_reconnect(netpath, vnic);
> +		}
>  		break;
>  	case VNIC_PRINP_LINKUP:
>  		vnic->primary_path.carrier = 1;
> 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Mon Dec 11 22:48:47 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 08:48:47 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <adad56q1g3t.fsf@cisco.com>
References: <adad56q1g3t.fsf@cisco.com>
Message-ID: <20061212064847.GB13509@mellanox.co.il>

> I think we could probably merge it but maybe it's better to put it in
> -mm for a cycle given that it's new and not too many people have
> looked at it yet.  And I still haven't gotten comfortable with the way
> CM is enabled.

Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP,
or you don't want it in 2.6.20 anyway?

-- 
MST


From yhkim93 at keti.re.kr  Mon Dec 11 23:02:05 2006
From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=)
Date: Tue, 12 Dec 2006 16:02:05 +0900
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
Message-ID: <20061212070219.E733C3B0009@sentry-two.sandia.gov>

I am developing the infiniband storage system. I use IBM PPC 440 SPe
667Mhz. so I have cross-compiled infiniband source to ppc. But the follow
message happened on consol. What is problem? I think to happen at DMA
allocation. Anybody are developing the infiniband driver on ppc?  And is
there any infiniband source that support ppc? Please help me.

Always thanks for openib members’s help.

 
============================================================================
==========================

Waiting for PHY auto negotiation to complete... done

ENET Speed is 1000 Mbps - FULL duplex connection

Using ppc_4xx_eth0 device

TFTP from server 192.168.1.1; our IP address is 192.168.1.10

Filename 'yucca/uImage'.

Load address: 0x200000

Loading: T #################################################################

         #################################################################

         #################################################################

         ###################################################

done

Bytes transferred = 1255776 (132960 hex)

## Booting image at 00200000 ...

   Image Name:   Linux-2.6.19

   Image Type:   PowerPC Linux Kernel Image (gzip compressed)

   Data Size:    1255712 Bytes =  1.2 MB

   Load Address: 00000000

   Entry Point:  00000000

   Verifying Checksum ... OK

   Uncompressing Kernel Image ... OK

Linux version 2.6.19 (root at yhkim-devpc) (gcc version 4.0.0) #2 Fri Dec 8
11:18:08 KST 2006

PCIE:1 successfully set as rootpoint

vendor-id 0xaaa1

device-id 0xbed1

Yucca port (Roland Dreier <rolandd at cisco.com>)

Zone PFN ranges:

  DMA             0 ->   196608

  Normal     196608 ->   196608

early_node_map[1] active PFN ranges

    0:        0 ->   196608

Built 1 zonelists.  Total pages: 195072

Kernel command line: root=/dev/nfs rw
nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx
ip=192.168.1.10:192.168.1.1::255.250PID hash table entries: 4096 (order:
12, 16384 bytes)

Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)

Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)

Memory: 776704k available (1900k kernel code, 592k data, 148k init, 0k
highmem)

Mount-cache hash table entries: 512

NET: Registered protocol family 16

PCI: Probing PCI hardware

NET: Registered protocol family 2

IP route cache hash table entries: 32768 (order: 5, 131072 bytes)

TCP established hash table entries: 131072 (order: 7, 524288 bytes)

TCP bind hash table entries: 65536 (order: 6, 262144 bytes)

TCP: Hash tables configured (established 131072 bind 65536)

TCP reno registered

io scheduler noop registered

io scheduler anticipatory registered (default)

io scheduler deadline registered

io scheduler cfq registered

Generic RTC Driver v1.07

Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled

serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A

serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A

serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A

RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize

PPC 4xx OCP EMAC driver, version 3.54

mal0: initialized, 1 TX channels, 1 RX channels

eth0: emac0, MAC 00:04:ac:01:ca:fe

eth0: found CIS8201 Gigabit Ethernet PHY (0x01)

ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)

ib_mthca: Initializing 0001:01:01.0

kernel BUG in __dma_alloc_coherent at arch/ppc/kernel/dma-mapping.c:233!

Oops: Exception in kernel mode, sig: 5 [#1]

NIP: C0004904 LR: C00048D0 CTR: 00000000

REGS: c0981c90 TRAP: 0700   Not tainted  (2.6.19)

MSR: 00029000 <EE,ME>  CR: 88FF4F82  XER: 00000000

TASK = c096db70[1] 'swapper' THREAD: c0980000

GPR00: 00000001 C0981D40 C096DB70 C0885840 00000000 0000001F EF4BAFFC
00029000

GPR08: C021E410 00000000 C097B828 00000000 28FF4F88 00000000 3FFE6500
00000001

GPR16: 007FFF93 00000000 00800000 FFFFFFFF 007FFF00 C0280000 C0220000
00000000

GPR24: EF48F3E0 C021E410 FF2FF000 C0981D9C C0885860 C09A3000 C0885840
00001000

NIP [C0004904] __dma_alloc_coherent+0x20c/0x2d8

LR [C00048D0] __dma_alloc_coherent+0x1d8/0x2d8

Call Trace:

[C0981D40] [C0004828] __dma_alloc_coherent+0x130/0x2d8 (unreliable)

[C0981D80] [C0273404] mthca_create_eq+0x338/0x438

[C0981DE0] [C0273668] mthca_init_eq_table+0x164/0x6c0

[C0981E20] [C0146A44] __mthca_init_one+0x924/0xbf4

[C0981E70] [C0272F08] mthca_init_one+0x74/0xbc

[C0981E90] [C00F6FE4] pci_device_probe+0x7c/0xa0

[C0981EB0] [C010FB58] really_probe+0x54/0x13c

[C0981ED0] [C011004C] __driver_attach+0xcc/0xf8

[C0981EF0] [C010EE7C] bus_for_each_dev+0x54/0x90

[C0981F20] [C010F958] driver_attach+0x24/0x34

[C0981F30] [C010F4B0] bus_add_driver+0x84/0x168

[C0981F50] [C011034C] driver_register+0x68/0xb0

[C0981F60] [C00F6C64] __pci_register_driver+0x98/0xa8

[C0981F70] [C02720D0] mthca_init+0x60/0x8c

[C0981F80] [C0001124] init+0x98/0x2a4

[C0981FF0] [C0003DA0] kernel_thread+0x44/0x60

Instruction dump:

3d20c028 8169d0e0 7c00f050 54003826 7c005a14 901b0000 815d0004 39200000

7d205379 38000000 41820008 38000001 <0f000000> 38000400 7d60f028 7d6b0378

Kernel panic - not syncing: Attempted to kill init!

 <0>Rebooting in 1 seconds..

 
U-Boot 1.1.6 (Dec  7 2006 - 16:36:13)

 
CPU:   AMCC PowerPC 440SPe Rev. B at 533.328 MHz (PLB=133, OPB=66, EBC=66
MHz)

       I2C boot EEPROM enabled

       Bootstrap Option D - Boot ROM Location I2C (Addr 0x50)

       Internal PCI arbiter enabled

       32 kB I-Cache 32 kB D-Cache

Board: Yucca - AMCC 440SPe Evaluation Board

I2C:   ready

DRAM:  1024 MB

FLASH:  5 MB

PCI:   Bus Dev VenId DevId Class Int

PCIE:1 successfully set as rootpoint

        01  01  15b3  6282  0c06  00

In:    serial

Out:   serial

Err:   serial

Net:   ppc_4xx_eth0

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/914953a0/attachment.html>

From eitan at mellanox.co.il  Mon Dec 11 23:10:15 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 12 Dec 2006 09:10:15 +0200
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <1165870759.21606.18477.camel@hal.voltaire.com>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<1165617195.26559.4435.camel@hal.voltaire.com>
	<457AC99E.8050402@mellanox.co.il>
	<1165870759.21606.18477.camel@hal.voltaire.com>
Message-ID: <457E55D7.5070603@mellanox.co.il>

Hal Rosenstock wrote:
> On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote:
>   
>> Hal Rosenstock wrote:
>>     
>>> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
>>>   
>>>       
>>>> Hal Rosenstock wrote:
>>>>     
>>>>         
>>>>> Hi Eitan,
>>>>>
>>>>> Just wanted to close the loop on the OpenSM issues of the last couple
>>>>> days.
>>>>>
>>>>> 1. When can you supply an OpenSM verbose log for the InformInfo
>>>>> subscribe problem you reported earlier today ? Failing that, I don't
>>>>> know how to reproduce this.
>>>>>   
>>>>>       
>>>>>           
>>>> Attached
>>>>     
>>>>         
>> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure.
>>     
>
> Any idea on when you will have a chance to look into this ?
>   
Maybe by the weekend.
>   
>>>>> 4. I encourage you to look at and comment on the OpenSM patches rather
>>>>> than waiting for them to be in the tree.
>>>>>   
>>>>>       
>>>>>           
>>>> I am sure you did not mean to, but now I have to admit my limited skills 
>>>> in catching bugs by reading patches :-( .
>>>>     
>>>>         
>>> Not just read, but they are there to try out as well.
>>>   
>>>       
>> I will need an automatic flow for that sake. I can not keep up with the 
>> amount of patches manually.
>> But I do not know how to automatically convert the mails into patches 
>> into a tree.
>>     
>>> You could try out the patches and do the same thing before they are
>>> committed.
>>>
>>>   
>>>       
>> I have automation based on the committed tree that pull it (git trem) , 
>> compile and run regression.
>> Actually this is how all other code is handled too.
>>     
>
> Are you referring to OFED ?
>   
No the current GIT tree under 
git://staging.openfabrics.org/~halr/management.git
> In the case of OFED, where do those "special" trees/branches come from ?
>   
No. I think we are having some miss-understanding:
I am not proposing using a pre-commit branch.
But if there is no such branch I can not do pre-commit testing.
I think it is fine to have post-commit bug reports. No big deal.
We branch when we go to an OFED release.
Then I have two regressions run every night. One on the trunk and one on 
the OFED branch.
This is how things were for OFED1.1 and OFED1.0.

It is your call if we need to have a "stable" trunk and experimental  
branch such that I will be able to test pre-trunk patches.

What I will not be able to do is to have an automatic system to select 
which patches to include in the regression, etc etc.

Eitan
> -- Hal
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From ogerlitz at voltaire.com  Tue Dec 12 00:51:58 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 12 Dec 2006 10:51:58 +0200
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <adar6v6ynmy.fsf@cisco.com>
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
	<50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
	<adar6v6ynmy.fsf@cisco.com>
Message-ID: <457E6DAE.3040206@voltaire.com>

Roland Dreier wrote:
>  > I would like to see this last set of patches integrated as is.
>  > I would like to get more experience with the current implementation
>  > before extending it to support other configurations.
> 
> Yeah, let's go with that.  Since ipath depends on 64BIT in Kconfig
> anyway I think this is OK for now.

This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 
instead of dma_addr_t causes the resulted patch to the IB ULPs to be 
quite big.

Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) 
consumer with this code?

Or.


From ogerlitz at voltaire.com  Tue Dec 12 00:57:32 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 12 Dec 2006 10:57:32 +0200
Subject: [openib-general] [PATCH] librdmacm: updated librdmacm to work
 with proposed 2.6.20 kernel CMA
In-Reply-To: <457D9B4A.6010507@ichips.intel.com>
References: <000001c6f877$23298c80$52fc070a@amr.corp.intel.com>
	<457BDF15.6090608@voltaire.com> <457D9B4A.6010507@ichips.intel.com>
Message-ID: <457E6EFC.6030601@voltaire.com>

Sean Hefty wrote:
> Can you just send a signed-off-by line?  I'll add the patch to the 
> librdmacm multicast branch.

> fix rdma_leave_multicast return code on the success path
> 
> Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
> 
> --- librdmacm/src/cma.c 2006-12-10 12:55:03.000000000 +0200
> +++ librdmacm-multicast/src/cma.c       2006-12-10 13:15:12.000000000 +0200
> @@ -1015,6 +1015,8 @@ int rdma_leave_multicast(struct rdma_cm_
>         ret = write(id->channel->fd, msg, size);
>         if (ret != size)
>                 ret = (ret > 0) ? -ENODATA : ret;
> +       else
> +               ret = 0;
> 
>         pthread_mutex_lock(&id_priv->mut);
>         while (mc->events_completed < resp->events_reported)


From vuhuong at mellanox.com  Tue Dec 12 00:58:01 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 00:58:01 -0800
Subject: [openib-general] Automatically connect to SRP target
In-Reply-To: <92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com>
References: <92daa7bf0612070202m2712f971t18477d2ef50a9618@mail.gmail.com>
	<92daa7bf0612111724p16124f17r208849124ca7ec64@mail.gmail.com>
	<457E065C.6030104@mellanox.com>
	<92daa7bf0612111841l70e4a653ked1d93ec1dc9f91@mail.gmail.com>
Message-ID: <457E6F19.90103@mellanox.com>

PN wrote:
> Hi Vu,
>  
> i have 2 more questions,
> now i have 3 srp targets and use LVM to form a GFS system.
>  
> after setting SRPHA_ENABLE=yes, i found that sometimes (~30%) it will 
> miss a target during reboot.
> i need to manually type "srp_daemon -e -o" to discover the missing target.
> is there any method such that the srp_daemon will repeat to try to 
> ensure all targets were found?
>  

Probably you didn't have a clean shutdown and the srp target 
still had the previous connection around (it does not have 
self clean up dead connection mechanism) then the next login 
the srp target reject the login request

However srp_daemon will scan the fabric every 60 sec and 
should pick up the missing target from previous scan


> also, currently there is only 1 cable connect to each dual ports client.
> is it normal to have the following messages? 
> Dec 12 10:18:10 storage02 run_srp_daemon[5471]: starting srp_daemon: 
> [HCA=mthca0] [port=2]
> Dec 12 10:18:13 storage02 run_srp_daemon[5483]: failed srp_daemon: 
> [HCA=mthca0] [port=2] [exit status=0]
> Dec 12 10:18:43 storage02 run_srp_daemon[5489]: starting srp_daemon: 
> [HCA=mthca0] [port=2]
> Dec 12 10:18:46 storage02 run_srp_daemon[5501]: failed srp_daemon: 
> [HCA=mthca0] [port=2] [exit status=0]
> .....[repeat infinitely]


This is fine. The srp_daemon for port 2 keep running and it 
will detect any target on the fabric if you plug the cable 
in; otherwise, there's no ill effect except these annoying 
error messages

-vu

> 
>  
> Thanks a lot,
> PN
>  
> 
> Below is the log:
>  
> Dec 12 10:17:18 storage02 network: Setting network parameters:  succeeded
> Dec 12 10:17:18 storage02 network: Bringing up loopback interface:  
> succeeded
> Dec 12 10:17:23 storage02 network: Bringing up interface eth0:  succeeded
> Dec 12 10:17:23 storage02 network: Bringing up interface ib0:  succeeded
> Dec 12 10:17:26 storage02 kernel:   REJ reason 0xa
> Dec 12 10:17:26 storage02 kernel: ib_srp: Connection failed
> Dec 12 10:17:26 storage02 kernel: scsi3 : SRP.T10:00D0680000000578
> Dec 12 10:17:26 storage02 kernel:   Vendor: Mellanox  Model: 
> IBSRP10-TGT       Rev: 1.46
> Dec 12 10:17:26 storage02 kernel:   Type:   
> Direct-Access                      ANSI SCSI revision: 03
> Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte 
> hdwr sectors (81964 MB)
> Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back
> Dec 12 10:17:26 storage02 kernel: SCSI device sdb: 160086528 512-byte 
> hdwr sectors (81964 MB)
> Dec 12 10:17:26 storage02 kernel: SCSI device sdb: drive cache: write back
> Dec 12 10:17:26 storage02 rpcidmapd: rpc.idmapd startup succeeded
> Dec 12 10:17:26 storage02 kernel:  sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 
> sdb7 >
> Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdb at scsi3, 
> channel 0, id 0, lun 0
> Dec 12 10:17:26 storage02 kernel: scsi4 : SRP.T10:00D06800000007B2
> Dec 12 10:17:26 storage02 kernel:   Vendor: Mellanox  Model: IBSRP10-TGT 
> hy-b  Rev: 1.46
> Dec 12 10:17:26 storage02 kernel:   Type:   
> Direct-Access                      ANSI SCSI revision: 03
> Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte 
> hdwr sectors (81964 MB)
> Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back
> Dec 12 10:17:26 storage02 kernel: SCSI device sdc: 160086528 512-byte 
> hdwr sectors (81964 MB)
> Dec 12 10:17:26 storage02 kernel: SCSI device sdc: drive cache: write back
> Dec 12 10:17:26 storage02 kernel:  sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 sdc6 >
> Dec 12 10:17:26 storage02 kernel: Attached scsi disk sdc at scsi4, 
> channel 0, id 0, lun 0
> Dec 12 10:17:26 storage02 scsi.agent[3668]: disk at 
> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host3/target3:0:0/3:0:0:0
> Dec 12 10:17:26 storage02 scsi.agent[3705]: disk at 
> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/host4/target4:0:0/4:0:0:0
> Dec 12 10:17:26 storage02 ccsd[3769]: Starting ccsd 1.0.7:
> Dec 12 10:17:26 storage02 ccsd[3769]:  Built: Aug 26 2006 15:01:49
> Dec 12 10:17:26 storage02 ccsd[3769]:  Copyright (C) Red Hat, Inc.  
> 2004  All rights reserved.
> Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 10
> Dec 12 10:17:26 storage02 kernel: Disabled Privacy Extensions on device 
> ffffffff80405540(lo)
> Dec 12 10:17:26 storage02 kernel: IPv6 over IPv4 tunneling driver
> Dec 12 10:17:26 storage02 ccsd:  succeeded
> Dec 12 10:17:26 storage02 kernel: CMAN 2.6.9-45.4.centos4 (built Aug 26 
> 2006 14:55:55) installed
> Dec 12 10:17:26 storage02 kernel: NET: Registered protocol family 30
> Dec 12 10:17:26 storage02 kernel: DLM 2.6.9-42.12.centos4 (built Aug 27 
> 2006 05:25:40) installed
> Dec 12 10:17:27 storage02 ccsd[3769]: cluster.conf (cluster name = 
> GFS_Cluster, version = 21) found.
> Dec 12 10:17:27 storage02 ccsd[3769]: Unable to perform sendto: Cannot 
> assign requested address
> Dec 12 10:17:27 storage02 run_srp_daemon[3845]: failed srp_daemon: 
> [HCA=mthca0] [port=2] [exit status=0]
> Dec 12 10:17:28 storage02 run_srp_daemon[3851]: starting srp_daemon: 
> [HCA=mthca0] [port=2]
> Dec 12 10:17:29 storage02 ccsd[3769]: Remote copy of cluster.conf is 
> from quorate node.
> Dec 12 10:17:29 storage02 ccsd[3769]:  Local version # : 21
> Dec 12 10:17:29 storage02 ccsd[3769]:  Remote version #: 21
> Dec 12 10:17:29 storage02 kernel: CMAN: Waiting to join or form a 
> Linux-cluster
> Dec 12 10:17:29 storage02 kernel: CMAN: sending membership request
> Dec 12 10:17:29 storage02 ccsd[3769]: Connected to cluster infrastruture 
> via: CMAN/SM Plugin v1.1.7.1
> Dec 12 10:17:29 storage02 ccsd[3769]: Initial status:: Inquorate
> Dec 12 10:17:30 storage02 kernel: CMAN: got node storage01
> Dec 12 10:17:30 storage02 kernel: CMAN: got node storage03
> Dec 12 10:17:30 storage02 kernel: CMAN: quorum regained, resuming activity
> Dec 12 10:17:30 storage02 ccsd[3769]: Cluster is quorate.  Allowing 
> connections.
> Dec 12 10:17:30 storage02 cman: startup succeeded
> Dec 12 10:17:30 storage02 lock_gulmd: no <gulm> section detected in 
> /etc/cluster/cluster.conf succeeded
> Dec 12 10:17:31 storage02 fenced: startup succeeded
> Dec 12 10:17:31 storage02 run_srp_daemon[4196]: failed srp_daemon: 
> [HCA=mthca0] [port=2] [exit status=0]
> Dec 12 10:17:33 storage02 run_srp_daemon[4224]: starting srp_daemon: 
> [HCA=mthca0] [port=2]
> Dec 12 10:17:36 storage02 run_srp_daemon[4236]: failed srp_daemon: 
> [HCA=mthca0] [port=2] [exit status=0]
> Dec 12 10:17:40 storage02 run_srp_daemon[4242]: starting srp_daemon: 
> [HCA=mthca0] [port=2]
> Dec 12 10:17:42 storage02 clvmd: Cluster LVM daemon started - connected 
> to CMAN
> Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 
> on node storage01
> Dec 12 10:17:42 storage02 kernel: CMAN: WARNING no listener for port 11 
> on node storage03
> Dec 12 10:17:42 storage02 clvmd: clvmd startup succeeded
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid 
> 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes 
> for volume group gfsvg.
> Dec 12 10:17:42 storage02 vgchange:
> Dec 12 10:17:42 storage02 vgchange: Couldn't find device with uuid 
> 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes 
> for volume group gfsvg.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid 
> 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes 
> for volume group gfsvg.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find device with uuid 
> 'U8viRP-K6Ev-0HlZ-5pwK-09co-tXgh-sJJKXT'.
> Dec 12 10:17:42 storage02 vgchange:   Couldn't find all physical volumes 
> for volume group gfsvg.
> Dec 12 10:17:42 storage02 vgchange:   Volume group "gfsvg" not found
> Dec 12 10:17:42 storage02 clvmd: Activating VGs: failed
> Dec 12 10:17:42 storage02 netfs: Mounting other filesystems:  succeeded
> Dec 12 10:17:42 storage02 kernel: Lock_Harness 2.6.9-58.2.centos4 (built 
> Aug 27 2006 05:27:43) installed
> Dec 12 10:17:42 storage02 kernel: GFS 2.6.9-58.2.centos4 (built Aug 27 
> 2006 05:28:00) installed
> Dec 12 10:17:42 storage02 mount: mount: special device /dev/gfsvg/gfslv 
> does not exist
> Dec 12 10:17:42 storage02 gfs: Mounting GFS filesystems:  failed
> Dec 12 10:17:42 storage02 kernel: i2c /dev entries driver
> .....
>  
>  
>  
>  
>  
>  
> 2006/12/12, Vu Pham <vuhuong at mellanox.com <mailto:vuhuong at mellanox.com>>:
> 
>     PN,
>       Edit file /etc/infiniband/openib.conf and set
> 
>     SRPHA_ENABLE=yes
> 
>     this will start srp_daemon by default
> 
>     -vu
> 
>      > No one can help me? :(
>      >
>      > PN
>      >
>      >
>      > 2006/12/7, Lai Dragonfly <poknam at gmail.com
>     <mailto:poknam at gmail.com> <mailto:poknam at gmail.com
>     <mailto:poknam at gmail.com>>>:
>      >
>      >     Hi all,
>      >
>      >     i'm using CentOS 4.4 (kernel 2.6.9-42.ELsmp) with OFED-1.1 in
>      >     clients and
>      >     IBGD-1.8.2-srpt in targets.
>      >     i found that even i use "modprobe ib_srp" or set SRP_LOAD=yes in
>      >     openib.conf,
>      >     i could not found the SRP target.
>      >     until i execute "srp_daemon -e -o", i can see all the targets
>     appear
>      >     in /dev/sdX.
>      >
>      >     since i want to export the targets to other nodes,
>      >     any idea so that i can connect to the targets automatically
>     in each
>      >     reboot.
>      >     without typing "srp_daemon -e -o" each time?
>      >
>      >     thanks in advance.
>      >
>      >     PN
>      >
>      >
>      >
>      >
>     ------------------------------------------------------------------------
>      >
>      > _______________________________________________
>      > openib-general mailing list
>      > openib-general at openib.org <mailto:openib-general at openib.org>
>      > http://openib.org/mailman/listinfo/openib-general
>      >
>      > To unsubscribe, please visit
>     http://openib.org/mailman/listinfo/openib-general
> 
> 


From vuhuong at mellanox.com  Tue Dec 12 01:03:54 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 01:03:54 -0800
Subject: [openib-general] srp initiator device discovery
In-Reply-To: <1165899109.14308.9.camel@julia.et.endace.com>
References: <mailman.374.1165886944.18259.openib-general@openib.org>
	<1165899109.14308.9.camel@julia.et.endace.com>
Message-ID: <457E707A.4040802@mellanox.com>

How many cable did you connect from your host to fabric?

If you have two cables (2 ports of same hca or each port of 
2 hcas) connected then you have two paths to same srp 
target. Each path will see the same number of luns of srp 
target. You can work with dm-multipath/multipath and access 
the luns/devices thru /dev/mapper - this will provide you 
capability of fail-over/fail-back functionality

IBGD's srp target only works with scsi devices. It does not 
work with block devices (hdX, md, lvm volules ...)

-vu

> Hi,
> 
>    I have srp initiator installed with OFED-1.1, and another machine
> with SRP target (IBGOLD). I started the srp daemon to discover the
> target devices, and then ran fdisk -l to see the list. The list (below)
> shows duplicate devices :-
> 
> Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes
> 255 heads, 63 sectors/track, 267349 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
> Disk /dev/sdb doesn't contain a valid partition table
> 
> Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes
> 255 heads, 63 sectors/track, 267349 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>    Device Boot      Start         End      Blocks   Id  System
> 
> Disk /dev/sdd: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sdd1   *           1          13      104391   83  Linux
> /dev/sdd2              14       60801   488279610   8e  Linux LVM
> 
> Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
> 255 heads, 63 sectors/track, 267349 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
> Disk /dev/sde doesn't contain a valid partition table
> 
> Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes
> 255 heads, 63 sectors/track, 267349 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>    Device Boot      Start         End      Blocks   Id  System
> 
> Disk /dev/sdg: 500.1 GB, 500107862016 bytes
> 255 heads, 63 sectors/track, 60801 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sdg1   *           1          13      104391   83  Linux
> /dev/sdg2              14       60801   488279610   8e  Linux LVM
> 
> 
> 
> Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious).
> 
> I also tested the device discovery after creating an md device on the
> target side, and found that the initiator doesn't take into account the
> presence of an md device. Is this the expected behaviour ?
> 
> Thanks for your time!
> 
> Vishal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From vuhuong at mellanox.com  Tue Dec 12 01:19:16 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 01:19:16 -0800
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <457E069A.4020807@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com>
Message-ID: <457E7414.6040802@mellanox.com>

James,
   I hit another variation of put_page problem. I just ran 
iozone with 9 GB file size (both client and server machines 
have 8 GB of memory, dual woodcrest xeon cpus, 2.6.18.5 
kernel, nfsrdma release 7)

After this happened other nfsrdma clients can still do I/O 
to the server

-vu

> Hit *send* too soon - here is the objdump of swap.o
> 
> -vu
> 
> 
>> James Lentini wrote:
>>> A couple of questions Vu:
>>>
>>> What NFS-RDMA release are you using? This looks like release 7.
>>>
>>
>> Yes. I'm using release 7
>>
>>> Is this reproducible?
>>
>> I ran into it twice - I think that it may co-relate to openSM restart 
>> incident. I'll double check it and confirm
>>
>>
>>> What kernel version are you using?
>>
>> 2.6.18.5
>>
>>> What hardware is this on? It looks like x86-64 to me, which is fine. 
>>> I just want to be sure I know what I'm looking at. As many specifics 
>>> as possible is good (number of CPUs, hyperthreading, etc.)
>>>
>>
>> Dual woodcrest xeon based CPUs
>>
>>> Could you send the output of
>>> objdump -Slr /path/to/kernel/mm/swap.o
>>>
>>
>> I attached the objdump output here
>>
>>> Actually, just the put_page disassembly is all I want to see.
>>>
>>> Is there any more text available? Usually there is an explanation 
>>> given for an oops message (e.g. "Unable to handle kernel paging 
>>> request..").
>>>
>>
>> I did not see any oops text message. System was still responsive with 
>> ipoib ping or login
>>
>>
>>> I opened a bug at the NFS-RDMA SourceForge project to track this:
>>>
>>> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 
>>>
>>
>> thanks for your help,
>>
>> -vu
>>
>>> Thanks for reporting this.
>>> james
>>>
>>> On Fri, 8 Dec 2006, Vu Pham wrote:
>>>
>>>> Hi James,
>>>>   I got these errors in server's /var/log/messages and then the 
>>>> server stop
>>>> responding to login, I/O...; however, the server is still up, ipoib 
>>>> is still
>>>> working
>>>>
>>>>
>>>> Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
>>>> [<ffffffff8025dff7>] put_page+0x17/0x40
>>>> Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 
>>>> 00010246
>>>> Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 
>>>> 0000000000000001
>>>> RCX: 000000000003ffff
>>>> Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 
>>>> 0000000000000001
>>>> RDI: ffff8102274e92f8
>>>> Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 
>>>> 0000000000000034
>>>> R09: 0000000000000000
>>>> Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 
>>>> 0000000000000000
>>>> R12: ffff81020ef96800
>>>> Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 
>>>> 0000000000000000
>>>> R15: ffff8102053ee890
>>>> Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
>>>> GS:ffff81022066eb40(0000) knlGS:0000000000000000
>>>> Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>>> 000000008005003b
>>>> Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 
>>>> 000000021c22b000
>>>> CR4: 00000000000006e0
>>>> Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
>>>> ffff810219dde000, task ffff81020d87f0c0)
>>>> Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 
>>>> ffff81020ef96968
>>>> ffff81020ef96800 ffff81020ef96958
>>>> Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
>>>> ffffffff80424e05 0000000000000000
>>>> Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
>>>> ffffffff80239b90 ffff81020d87f0c0
>>>> Dec  8 06:38:21 ibd201 kernel: Call Trace:
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
>>>> :sunrpc:svc_rdma_put_context+0x37/0xd0
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
>>>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>>> schedule_timeout+0x95/0xb0
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] 
>>>> process_timeout+0x0/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
>>>> wait_for_completion_timeout+0xcd/0x150
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>> default_wake_function+0x0/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
>>>> :ib_mthca:mthca_cmd_post+0x232/0x260
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>> default_wake_function+0x0/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] 
>>>> __next_cpu+0x19/0x30
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
>>>> find_busiest_group+0x24e/0x6d0
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] 
>>>> thread_return+0x0/0xde
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
>>>> _spin_unlock_irqrestore+0x8/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
>>>> try_to_del_timer_sync+0x51/0x60
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] 
>>>> del_timer_sync+0xc/0x20
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>>> schedule_timeout+0x95/0xb0
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
>>>> :sunrpc:svc_recv+0x416/0x510
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>> default_wake_function+0x0/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>> default_wake_function+0x0/0x10
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>> :nfsd:nfsd+0x0/0x380
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] 
>>>> :nfsd:nfsd+0x111/0x380
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] child_rip+0xa/0x12
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>> :nfsd:nfsd+0x0/0x380
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>> :nfsd:nfsd+0x0/0x380
>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] child_rip+0x0/0x12
>>>> Dec  8 06:38:21 ibd201 kernel:
>>>> Dec  8 06:38:21 ibd201 kernel:
>>>> Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 
>>>> f0 ff 4f 08
>>>> 0f 94 c0 84 c0 74
>>>> Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] 
>>>> put_page+0x17/0x40
>>>> Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
>>>>
>>>> -vu
>>>>
>>
>>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages.202
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/2dbd0ff8/attachment.ksh>

From eeb at bartonsoftware.com  Tue Dec 12 01:37:47 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Tue, 12 Dec 2006 09:37:47 -0000
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <adamz5uynkh.fsf@cisco.com>
Message-ID: <099601c71dd1$2ed415d0$0281a8c0@ebpc>

>  > > No other kernel subsystem has one, so I don't think it's 
>  > > realistic to expect one for IB.
> 
>  > Don't you think it would be useful?  Even if only to make 
>  > API changes explicit?
> 
> Sure, I admit it would be useful for out-of-tree code.  But it would
> also be an unmaintainable mess to actually try and have a set of
> feature flags, so I don't think we can do it.

At the risk of flogging a dead horse - I was only thinking of a very simple
version number that incremented on change - something like
LINUX_VERSION_CODE?

                Cheers,
                        Eric


From eitan at sw053.yok.mtl.com  Tue Dec 12 01:45:01 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Tue, 12 Dec 2006 11:45:01 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-12:normal completion
Message-ID: <200612120945.kBC9j1RK024188@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Mon_Dec_11_12:18:47_2006 a12f32 
ibutils rev = Mon_Dec_11_12:42:28_2006 2ba86a 
Total=242 Pass=241 Fail=1

Pass:
33 Stability IS1-16.topo
33 Pkey IS1-16.topo
33 OsmStress IS1-16.topo
33 Multicast IS1-16.topo
33 LidMgr IS1-16.topo
11 Stability IS3-loop.topo
11 Stability IS3-128.topo
11 Pkey IS3-128.topo
11 Multicast IS3-loop.topo
11 Multicast IS3-128.topo
11 LidMgr IS3-128.topo
10 OsmStress IS3-128.topo

Failures:
1 OsmStress IS3-128.topo


From mst at mellanox.co.il  Tue Dec 12 04:29:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 14:29:57 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061210225613.GF21155@sashak.voltaire.com>
References: <20061210225613.GF21155@sashak.voltaire.com>
Message-ID: <20061212122957.GC14622@mellanox.co.il>

> For me it is unclear yet how long we may need this - 1.1 still be in
> SVN yet, and 1.1 git branch is updated there.

By the way, one can't actually build OFED 1.1 userspace from git
because OFED also applies some patches after checking things out
from svn. They are here:
https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes

-- 
MST


From mst at mellanox.co.il  Tue Dec 12 05:42:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 15:42:35 +0200
Subject: [openib-general] openib-commits and git
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC83@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B2FC83@xmb-sjc-216.amer.cisco.com>
Message-ID: <20061212134235.GB26613@mellanox.co.il>

> > -----Original Message-----
> > From: openib-general-bounces at openib.org 
> > [mailto:openib-general-bounces at openib.org] On Behalf Of Hal Rosenstock
> > Sent: Monday, December 11, 2006 2:07 PM
> > To: openib-general at openib.org
> > Cc: OpenFabricsEWG
> > Subject: [openib-general] openib-commits and git
> > 
> > Hi,
> > 
> > Some have requested the equivalent of what we had with svn with
> > openib-commits. 
> > 
> > The first question is what capabilities in this are desired. We don't
> > want to spend a lot of engineering time on this but it would 
> > be good to
> > know. Is a notification of the commit/push with the log sufficient or
> > does it need to look more what svn provided (and include the changes
> > too) ?
> > 
> > The other question is a policy one: Is it a reasonable 
> > default to enable
> > this for all the developers ? Do any of the developers object to this
> > policy ?
> 
> Quoting r. Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>:
> Subject: Re: openib-commits and git
> 
> I would like to see diffs, either inline in the commit email or via a
> URL I can click on.

In that case, why bother with email at all?
gitweb already has RSS support, which Sasha has activated.

Look at any git tree in gitweb (e.g. http://staging.openfabrics.org/git/)
and you'll see an RSS feed URL.

This can be fed to any RSS aggregator, including the firefox live bookmarks one.

-- 
MST


From halr at voltaire.com  Tue Dec 12 05:53:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Dec 2006 08:53:08 -0500
Subject: [openib-general] OpenSM Issues of the last couple days
In-Reply-To: <457E55D7.5070603@mellanox.co.il>
References: <1165531651.25587.204056.camel@hal.voltaire.com>
	<457995E5.40303@mellanox.co.il>
	<1165617195.26559.4435.camel@hal.voltaire.com>
	<457AC99E.8050402@mellanox.co.il>
	<1165870759.21606.18477.camel@hal.voltaire.com>
	<457E55D7.5070603@mellanox.co.il>
Message-ID: <1165931584.28709.4614.camel@hal.voltaire.com>

On Tue, 2006-12-12 at 02:10, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Sat, 2006-12-09 at 09:35, Eitan Zahavi wrote:
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> On Fri, 2006-12-08 at 11:42, Eitan Zahavi wrote: 
> >>>   
> >>>       
> >>>> Hal Rosenstock wrote:
> >>>>     
> >>>>         
> >>>>> Hi Eitan,
> >>>>>
> >>>>> Just wanted to close the loop on the OpenSM issues of the last couple
> >>>>> days.
> >>>>>
> >>>>> 1. When can you supply an OpenSM verbose log for the InformInfo
> >>>>> subscribe problem you reported earlier today ? Failing that, I don't
> >>>>> know how to reproduce this.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> Attached
> >>>>     
> >>>>         
> >> I will need to look into it in greater details. Might be a simulator flow issue. But I am not sure.
> >>     
> >
> > Any idea on when you will have a chance to look into this ?
> >   
> Maybe by the weekend.
> >   
> >>>>> 4. I encourage you to look at and comment on the OpenSM patches rather
> >>>>> than waiting for them to be in the tree.
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> I am sure you did not mean to, but now I have to admit my limited skills 
> >>>> in catching bugs by reading patches :-( .
> >>>>     
> >>>>         
> >>> Not just read, but they are there to try out as well.
> >>>   
> >>>       
> >> I will need an automatic flow for that sake. I can not keep up with the 
> >> amount of patches manually.
> >> But I do not know how to automatically convert the mails into patches 
> >> into a tree.
> >>     
> >>> You could try out the patches and do the same thing before they are
> >>> committed.
> >>>
> >>>   
> >>>       
> >> I have automation based on the committed tree that pull it (git trem) , 
> >> compile and run regression.
> >> Actually this is how all other code is handled too.
> >>     
> >
> > Are you referring to OFED ?
> >   
> No the current GIT tree under 
> git://staging.openfabrics.org/~halr/management.git

OK but I was commenting on what you said about "all other code" being
handled this way.

> > In the case of OFED, where do those "special" trees/branches come from ?
> >   
> No. I think we are having some miss-understanding:
> I am not proposing using a pre-commit branch.
> But if there is no such branch I can not do pre-commit testing.

Understood.

> I think it is fine to have post-commit bug reports. No big deal.

Right; rather than "pre trunk commit" ones. If it breaks, we try to fix
it as fast as possible or perhaps even back out the change if there is
some critical reason to do so.

> We branch when we go to an OFED release.

Yes.

> Then I have two regressions run every night. One on the trunk and one on 
> the OFED branch.
> This is how things were for OFED1.1 and OFED1.0.

That would be great.

> It is your call if we need to have a "stable" trunk and experimental  
> branch such that I will be able to test pre-trunk patches.

I'll consider this based on how stable or unstable the trunk is as we go
forward but still prefer to not have to maintain another branch (for
obvious reasons).

> What I will not be able to do is to have an automatic system to select 
> which patches to include in the regression, etc etc.

OK.

-- Hal

> Eitan
> > -- Hal
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From sashak at voltaire.com  Tue Dec 12 06:50:31 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 Dec 2006 16:50:31 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061212054223.GB11064@mellanox.co.il>
References: <20061210233657.GB32199@sashak.voltaire.com>
	<20061211054539.GL9205@mellanox.co.il>
	<20061212000911.GJ25052@sashak.voltaire.com>
	<20061212054223.GB11064@mellanox.co.il>
Message-ID: <20061212145031.GE10901@sashak.voltaire.com>

On 07:42 Tue 12 Dec     , Michael S. Tsirkin wrote:
> Sasha, one small request: could you please fix description for your trees?
> It should hopefully say something like "mirror of svn for <path>".

Yes, sure.

Sasha


From sashak at voltaire.com  Tue Dec 12 07:07:03 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 12 Dec 2006 17:07:03 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061212055841.GD11064@mellanox.co.il>
References: <20061210233657.GB32199@sashak.voltaire.com>
	<20061211054539.GL9205@mellanox.co.il>
	<20061212000911.GJ25052@sashak.voltaire.com>
	<20061212055841.GD11064@mellanox.co.il>
Message-ID: <20061212150703.GF10901@sashak.voltaire.com>

On 07:58 Tue 12 Dec     , Michael S. Tsirkin wrote:
> > > Finally, it wastes space.
> > 
> > 'git-clone -s' helps to save space.
> 
> BTW, be careful with that: it seems clone -s might lose your data if the repository
> you clone from removes some heads and prunes history.

It is hard to lose data fatally this way. Only when origin repo was
removed completely (then you can lose this old part of history). Use
'git-clone -l' if unsure.

And this still be theoretical discussion - largest userspace tree on OFA
takes 10MB disk space.

Sasha


From mst at mellanox.co.il  Tue Dec 12 07:10:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 17:10:16 +0200
Subject: [openib-general] [PATCH] mthca: make all MRs accessible for FMR
 mapping on 64 bit kernels
Message-ID: <20061212151016.GI26613@mellanox.co.il>

For Tavor, we currently reserve separate MPT and MTT space for FMRs so avoid
abusing the vmalloc space on 32 bit kernels. No such problem exists
on 64 bit kernels so let's not do it there.

This way we have a shared pool for MR and FMR resources, used on demand.
This will also make it possible to write MTTs for regular regions directly from driver.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

This patch passed verbs and SRP testing here. Please consider this for 2.6.20.

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -762,7 +762,7 @@ void mthca_arbel_fmr_unmap(struct mthca_
 int __devinit mthca_init_mr_table(struct mthca_dev *dev)
 {
 	unsigned long addr;
-	int err, i;
+	int mpts, mtts, err, i;
 
 	err = mthca_alloc_init(&dev->mr_table.mpt_alloc,
 			       dev->limits.num_mpts,
@@ -796,13 +796,21 @@ int __devinit mthca_init_mr_table(struct
 			err = -EINVAL;
 			goto err_fmr_mpt;
 		}
+		mpts = mtts = 1 << i;
+	} else {
+		mpts = dev->limits.num_mtt_segs;
+		mtts = dev->limits.num_mpts;
+	}
+
+	if (!mthca_is_memfree(dev) &&
+	    (dev->mthca_flags & MTHCA_FLAG_FMR)) {
 
 		addr = pci_resource_start(dev->pdev, 4) +
 			((pci_resource_len(dev->pdev, 4) - 1) &
 			 dev->mr_table.mpt_base);
 
 		dev->mr_table.tavor_fmr.mpt_base =
-			ioremap(addr, (1 << i) * sizeof(struct mthca_mpt_entry));
+			ioremap(addr, mpts * sizeof(struct mthca_mpt_entry));
 
 		if (!dev->mr_table.tavor_fmr.mpt_base) {
 			mthca_warn(dev, "MPT ioremap for FMR failed.\n");
@@ -815,19 +823,21 @@ int __devinit mthca_init_mr_table(struct
 			 dev->mr_table.mtt_base);
 
 		dev->mr_table.tavor_fmr.mtt_base =
-			ioremap(addr, (1 << i) * MTHCA_MTT_SEG_SIZE);
+			ioremap(addr, mtts * MTHCA_MTT_SEG_SIZE);
 		if (!dev->mr_table.tavor_fmr.mtt_base) {
 			mthca_warn(dev, "MTT ioremap for FMR failed.\n");
 			err = -ENOMEM;
 			goto err_fmr_mtt;
 		}
+	}
 
-		err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, i);
+	if (dev->limits.fmr_reserved_mtts) {
+		err = mthca_buddy_init(&dev->mr_table.tavor_fmr.mtt_buddy, mtts);
 		if (err)
 			goto err_fmr_mtt_buddy;
 
 		/* Prevent regular MRs from using FMR keys */
-		err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, i);
+		err = mthca_buddy_alloc(&dev->mr_table.mtt_buddy, mtts);
 		if (err)
 			goto err_reserve_fmr;
 
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_profile.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_profile.c
@@ -277,7 +277,7 @@ u64 mthca_make_profile(struct mthca_dev 
 	 * out of the MR pool. They don't use additional memory, but
 	 * we assign them as part of the HCA profile anyway.
 	 */
-	if (mthca_is_memfree(dev))
+	if (mthca_is_memfree(dev) || BITS_PER_LONG == 64)
 		dev->limits.fmr_reserved_mtts = 0;
 	else
 		dev->limits.fmr_reserved_mtts = request->fmr_reserved_mtts;
-- 
MST


From mst at mellanox.co.il  Tue Dec 12 07:10:39 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 17:10:39 +0200
Subject: [openib-general] [PATCH] mthca: speed up memory registration by
 filling MTTs directly
Message-ID: <20061212151039.GJ26613@mellanox.co.il>

Speed up memory registration by filling in MTTs directly.  This reduces the
number of FW commands needed to register an MR by at least a factor of 2.  This
applies to all memfree cards, and to tavor mode on 64 bit systems with the patch
I posted earlier.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

This passed verbs testing here, please consider for 2.6.20.

Note that this *not* FMR - this is regular IB memory registration since
MPTs are still updated using FW command.

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de
 int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd);
 void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
 
+int mthca_write_mtt_size(struct mthca_dev *dev);
+
 struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size);
 void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt);
 int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de
 	kfree(mtt);
 }
 
-int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
-		    int start_index, u64 *buffer_list, int list_len)
+static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			     int start_index, u64 *buffer_list, int list_len)
 {
 	struct mthca_mailbox *mailbox;
 	__be64 *mtt_entry;
@@ -296,6 +296,84 @@ out:
 	return err;
 }
 
+void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	u64 __iomem *mtts;
+	u32 mtt_seg;
+	int i;
+
+	mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE;
+       	mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64);
+	for (i = 0; i < list_len; ++i) {
+		__be64 mtt_entry = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+		mthca_write64_raw(mtt_entry, mtts + i);
+	}
+}
+
+void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	__be64 *mtts;
+	int i;
+	int s = start_index * sizeof (u64);
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64)) / PAGE_SIZE);
+	/* Require full segments */
+	BUG_ON(s % MTHCA_MTT_SEG_SIZE);
+
+	mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg +
+				s / MTHCA_MTT_SEG_SIZE);
+
+	BUG_ON(!mtts);
+
+	for (i = 0; i < list_len; ++i)
+		mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT);
+}
+
+int mthca_write_mtt_size(struct mthca_dev *dev)
+{
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		/*
+		 * Be friendly to WRITE_MTT command
+		 * and leave two empty slots for the
+		 * index and reserved fields of the
+		 * mailbox.
+		 */
+		return PAGE_SIZE / sizeof (u64) - 2;
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff;
+}
+
+int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+		    int start_index, u64 *buffer_list, int list_len)
+{
+	int size = mthca_write_mtt_size(dev);
+	int chunk;
+
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len);
+
+	while (list_len > 0) {
+		chunk = min(size, list_len);
+		if (mthca_is_memfree(dev))
+			mthca_arbel_write_mtt_seg(dev, mtt, start_index,
+						       	buffer_list, list_len);
+		else
+			mthca_tavor_write_mtt_seg(dev, mtt, start_index,
+						       	buffer_list, list_len);
+
+		list_len    -= chunk;
+		start_index += chunk;
+		buffer_list += chunk;
+	}
+
+	return 0;
+}
+
 static inline u32 tavor_hw_index_to_key(u32 ind)
 {
 	return ind;
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s
 	int shift, n, len;
 	int i, j, k;
 	int err = 0;
+	int write_mtt_size;
 
 	shift = ffs(region->page_size) - 1;
 
@@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s
 
 	i = n = 0;
 
+	write_mtt_size = min(mthca_write_mtt_size(dev), PAGE_SIZE / sizeof *pages);
+
 	list_for_each_entry(chunk, &region->chunk_list, list)
 		for (j = 0; j < chunk->nmap; ++j) {
 			len = sg_dma_len(&chunk->page_list[j]) >> shift;
@@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s
 				pages[i++] = sg_dma_address(&chunk->page_list[j]) +
 					region->page_size * k;
 				/*
-				 * Be friendly to WRITE_MTT command
-				 * and leave two empty slots for the
-				 * index and reserved fields of the
-				 * mailbox.
+				 * Be friendly to write_mtt and pass it chunks
+				 * of appropriate size.
 				 */
-				if (i == PAGE_SIZE / sizeof (u64) - 2) {
-					err = mthca_write_mtt(dev, mr->mtt,
-							      n, pages, i);
+				if (i == write_mtt_size) {
+					err = mthca_write_mtt(dev, mr->mtt, n, pages, i);
 					if (err)
 						goto mtt_done;
 					n += i;

-- 
MST


From vlad at dev.mellanox.co.il  Tue Dec 12 07:53:58 2006
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 12 Dec 2006 17:53:58 +0200
Subject: [openib-general] Daily build of userspace and kernel packages for
	OFED-1.2
Message-ID: <457ED096.1020703@dev.mellanox.co.il>

Hi,
The userspace and kernel space packages for OFED-1.2 developers can be 
downloaded from: http://staging.openfabrics.org/builds.
User: http://staging.openfabrics.org/builds/ofa_1_2_user/
Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/

last_stable.tgz link points to the latest package that passed 
compilation on the build machine (staging.openfabrics.org OS Ubuntu 
6.06.1 with kernel 2.6.15-23-server)

To install user/kernel:
Download and open tgz file
Run
    ./configure PARAMETERS (see configure --help)
    make
    make install

User space packages from git:

    libibverbs_git="git://git.kernel.org/pub/scm/libs/infiniband/libibverbs.git"
    libmthca_git="git://git.kernel.org/pub/scm/libs/infiniband/libmthca.git"
    libehca_git="git://staging.openfabrics.org/~hnguyen/libehca.git"
    libipathverbs_git="git://staging.openfabrics.org/~bos/libipathverbs.git"
    tvflash_git="git://staging.openfabrics.org/~rdreier/tvflash.git"
    libibcm_git="git://staging.openfabrics.org/~shefty/libibcm.git"
    libsdp_git="git://staging.openfabrics.org/~eitan/libsdp.git"
    mstflint_git="git://staging.openfabrics.org/~mst/mstflint.git"
    perftest_git="git://staging.openfabrics.org/~mst/perftest.git"
    srptools_git="git://staging.openfabrics.org/~ishai/srptools.git"
    ipoibtools_git="git://staging.openfabrics.org/~vlad/ipoibtools.git"
    librdmacm_git="git://staging.openfabrics.org/~shefty/librdmacm.git"
    dapl_git="git://staging.openfabrics.org/~ardavis/dapl.git"
    imgen_git="git://staging.openfabrics.org/~mst/imgen.git"
    management_git="git://staging.openfabrics.org/~halr/management.git"
    scripts_git="git://staging.openfabrics.org/~vlad/ofascripts.git"

Kernel space:
       git://staging.openfabrics.org/~vlad/ofed_1_2

I'd be glad to get comments.

Regards,
Vladimir


From rdreier at cisco.com  Tue Dec 12 08:42:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 08:42:58 -0800
Subject: [openib-general] version #defines for the kernel
References: <099601c71dd1$2ed415d0$0281a8c0@ebpc>
Message-ID: <adaejr5xdyl.fsf@cisco.com>

 > At the risk of flogging a dead horse - I was only thinking of a very simple
 > version number that incremented on change - something like
 > LINUX_VERSION_CODE?

In that case what do you expect to see in a kernel with backported
drivers, that has backported some changes but not others?

 - R.


From rdreier at cisco.com  Tue Dec 12 09:07:58 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 09:07:58 -0800
Subject: [openib-general] [PATCH 1/2 vex branch] IB/VNIC Fix failover
 from secondary path back to primary path
References: <45784230.28135.250C4227@ramachandra.kuchimanchi.qlogic.com>
	<C07C40DB2364324799506DE8FF12F8D81A125F@EPEXCH1.qlogic.org>
Message-ID: <ada8xhdxcsx.fsf@cisco.com>

 > Did you get a chance to look at these patches ?

Not yet ... I will just apply them to the vex branch though.

 - R.


From vuhuong at mellanox.com  Tue Dec 12 09:46:47 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 09:46:47 -0800
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <457E7414.6040802@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com>
	<457E7414.6040802@mellanox.com>
Message-ID: <457EEB07.8040904@mellanox.com>

James,
   Another variation of put_page problem. I have stopped 
doing I/O or accessing the mounted directory since last 
night. This morning I just try to do *ls* the mounted 
directory and get this error

-vu

> James,
>   I hit another variation of put_page problem. I just ran iozone with 9 
> GB file size (both client and server machines have 8 GB of memory, dual 
> woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma release 7)
> 
> After this happened other nfsrdma clients can still do I/O to the server
> 
> -vu
> 
>> Hit *send* too soon - here is the objdump of swap.o
>>
>> -vu
>>
>>
>>> James Lentini wrote:
>>>> A couple of questions Vu:
>>>>
>>>> What NFS-RDMA release are you using? This looks like release 7.
>>>>
>>>
>>> Yes. I'm using release 7
>>>
>>>> Is this reproducible?
>>>
>>> I ran into it twice - I think that it may co-relate to openSM restart 
>>> incident. I'll double check it and confirm
>>>
>>>
>>>> What kernel version are you using?
>>>
>>> 2.6.18.5
>>>
>>>> What hardware is this on? It looks like x86-64 to me, which is fine. 
>>>> I just want to be sure I know what I'm looking at. As many specifics 
>>>> as possible is good (number of CPUs, hyperthreading, etc.)
>>>>
>>>
>>> Dual woodcrest xeon based CPUs
>>>
>>>> Could you send the output of
>>>> objdump -Slr /path/to/kernel/mm/swap.o
>>>>
>>>
>>> I attached the objdump output here
>>>
>>>> Actually, just the put_page disassembly is all I want to see.
>>>>
>>>> Is there any more text available? Usually there is an explanation 
>>>> given for an oops message (e.g. "Unable to handle kernel paging 
>>>> request..").
>>>>
>>>
>>> I did not see any oops text message. System was still responsive with 
>>> ipoib ping or login
>>>
>>>
>>>> I opened a bug at the NFS-RDMA SourceForge project to track this:
>>>>
>>>> http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 
>>>>
>>>
>>> thanks for your help,
>>>
>>> -vu
>>>
>>>> Thanks for reporting this.
>>>> james
>>>>
>>>> On Fri, 8 Dec 2006, Vu Pham wrote:
>>>>
>>>>> Hi James,
>>>>>   I got these errors in server's /var/log/messages and then the 
>>>>> server stop
>>>>> responding to login, I/O...; however, the server is still up, ipoib 
>>>>> is still
>>>>> working
>>>>>
>>>>>
>>>>> Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
>>>>> [<ffffffff8025dff7>] put_page+0x17/0x40
>>>>> Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS: 
>>>>> 00010246
>>>>> Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX: 
>>>>> 0000000000000001
>>>>> RCX: 000000000003ffff
>>>>> Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI: 
>>>>> 0000000000000001
>>>>> RDI: ffff8102274e92f8
>>>>> Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08: 
>>>>> 0000000000000034
>>>>> R09: 0000000000000000
>>>>> Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11: 
>>>>> 0000000000000000
>>>>> R12: ffff81020ef96800
>>>>> Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14: 
>>>>> 0000000000000000
>>>>> R15: ffff8102053ee890
>>>>> Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
>>>>> GS:ffff81022066eb40(0000) knlGS:0000000000000000
>>>>> Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>>>> 000000008005003b
>>>>> Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3: 
>>>>> 000000021c22b000
>>>>> CR4: 00000000000006e0
>>>>> Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
>>>>> ffff810219dde000, task ffff81020d87f0c0)
>>>>> Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547 
>>>>> ffff81020ef96968
>>>>> ffff81020ef96800 ffff81020ef96958
>>>>> Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
>>>>> ffffffff80424e05 0000000000000000
>>>>> Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
>>>>> ffffffff80239b90 ffff81020d87f0c0
>>>>> Dec  8 06:38:21 ibd201 kernel: Call Trace:
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
>>>>> :sunrpc:svc_rdma_put_context+0x37/0xd0
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
>>>>> :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>>>> schedule_timeout+0x95/0xb0
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>] 
>>>>> process_timeout+0x0/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
>>>>> wait_for_completion_timeout+0xcd/0x150
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>>> default_wake_function+0x0/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
>>>>> :ib_mthca:mthca_cmd_post+0x232/0x260
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>>> default_wake_function+0x0/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>] 
>>>>> __next_cpu+0x19/0x30
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
>>>>> find_busiest_group+0x24e/0x6d0
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>] 
>>>>> thread_return+0x0/0xde
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
>>>>> _spin_unlock_irqrestore+0x8/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
>>>>> try_to_del_timer_sync+0x51/0x60
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>] 
>>>>> del_timer_sync+0xc/0x20
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
>>>>> schedule_timeout+0x95/0xb0
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
>>>>> :sunrpc:svc_recv+0x416/0x510
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>>> default_wake_function+0x0/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
>>>>> default_wake_function+0x0/0x10
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>>> :nfsd:nfsd+0x0/0x380
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>] 
>>>>> :nfsd:nfsd+0x111/0x380
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>] 
>>>>> child_rip+0xa/0x12
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>>> :nfsd:nfsd+0x0/0x380
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>] 
>>>>> :nfsd:nfsd+0x0/0x380
>>>>> Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>] 
>>>>> child_rip+0x0/0x12
>>>>> Dec  8 06:38:21 ibd201 kernel:
>>>>> Dec  8 06:38:21 ibd201 kernel:
>>>>> Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01 
>>>>> f0 ff 4f 08
>>>>> 0f 94 c0 84 c0 74
>>>>> Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>] 
>>>>> put_page+0x17/0x40
>>>>> Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
>>>>>
>>>>> -vu
>>>>>
>>>
>>>
> 
> ------------------------------------------------------------------------
> 
> <snip>
> 
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please bite here ] ---------
> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP 
> Dec 12 01:09:30 ibd202 kernel: CPU 1 
> Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 floppy ext3 jbd megaraid_sas sd_mod scsi_mod
> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]  [<ffffffff8025892b>] put_page+0x13/0x2e
> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000006a53
> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 RDI: ffff81024fc3dec0
> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 R09: 0000000000000000
> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 R12: ffff810240fb3800
> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 R15: 00000000000dbba0
> Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000) GS:ffff81024688eac0(0000) knlGS:0000000000000000
> Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 CR4: 00000000000006e0
> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo ffff81023fd10000, task ffff810246562840)
> Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8 0000000000000000 ffff81024172c5b0
> Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000 ffffffff8817b2a8 0000000000000000
> Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80 0000000000000001 ffff81023cda7000
> Dec 12 01:09:30 ibd202 kernel: Call Trace:
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>] :sunrpc:svc_rdma_put_context+0x37/0xb5
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>] :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>] :sunrpc:svc_rdma_get_context+0x10c/0x128
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>] :sunrpc:send_write+0x200/0x22c
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>] generic_file_readv+0x8e/0xa7
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>] zone_statistics+0x40/0x70
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>] find_busiest_group+0x21f/0x66f
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>] _spin_unlock_irq+0x6/0xa
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>] _spin_lock_irqsave+0x9/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>] lock_timer_base+0x1b/0x3c
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>] try_to_del_timer_sync+0x4a/0x51
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>] schedule_timeout+0x92/0xad
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>] :sunrpc:svc_recv+0x3c5/0x4be
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>] default_wake_function+0x0/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>] default_wake_function+0x0/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
> Dec 12 01:09:30 ibd202 kernel: 
> Dec 12 01:09:30 ibd202 kernel: 
> Dec 12 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f 94 c0 84 c0 74 
> Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>] put_page+0x13/0x2e
> Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
> Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please bite here ] ---------
> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP 
> Dec 12 01:09:30 ibd202 kernel: CPU 1 
> Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000 floppy ext3 jbd megaraid_sas sd_mod scsi_mod
> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]  [<ffffffff8025892b>] put_page+0x13/0x2e
> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000006a53
> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001 RDI: ffff81024fc3dec0
> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001 R09: 0000000000000000
> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8 R12: ffff810240fb3800
> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400 R15: 00000000000dbba0
> Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000) GS:ffff81024688eac0(0000) knlGS:0000000000000000
> Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000 CR4: 00000000000006e0
> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo ffff81023fd10000, task ffff810246562840)
> Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8 0000000000000000 ffff81024172c5b0
> Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000 ffffffff8817b2a8 0000000000000000
> Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80 0000000000000001 ffff81023cda7000
> Dec 12 01:09:30 ibd202 kernel: Call Trace:
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>] :sunrpc:svc_rdma_put_context+0x37/0xb5
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>] :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>] :sunrpc:svc_rdma_get_context+0x10c/0x128
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>] :sunrpc:send_write+0x200/0x22c
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>] generic_file_readv+0x8e/0xa7
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>] zone_statistics+0x40/0x70
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>] find_busiest_group+0x21f/0x66f
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>] _spin_unlock_irq+0x6/0xa
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>] _spin_lock_irqsave+0x9/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>] lock_timer_base+0x1b/0x3c
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>] try_to_del_timer_sync+0x4a/0x51
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>] schedule_timeout+0x92/0xad
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>] :sunrpc:svc_recv+0x3c5/0x4be
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>] default_wake_function+0x0/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>] default_wake_function+0x0/0xe
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
> Dec 12 01:09:30 ibd202 kernel: 
> Dec 12 01:09:30 ibd202 kernel: 
> Dec 12 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f 94 c0 84 c0 74 
> Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>] put_page+0x13/0x2e
> Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
> Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: messages.202.1
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/a4e6e983/attachment.ksh>

From ralph.campbell at qlogic.com  Tue Dec 12 10:22:58 2006
From: ralph.campbell at qlogic.com (Ralph Campbell)
Date: Tue, 12 Dec 2006 10:22:58 -0800
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <457E6DAE.3040206@voltaire.com>
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
	<50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
	<adar6v6ynmy.fsf@cisco.com> <457E6DAE.3040206@voltaire.com>
Message-ID: <1165947778.14800.315.camel@brick.pathscale.com>

On Tue, 2006-12-12 at 10:51 +0200, Or Gerlitz wrote:
> Roland Dreier wrote:
> >  > I would like to see this last set of patches integrated as is.
> >  > I would like to get more experience with the current implementation
> >  > before extending it to support other configurations.
> > 
> > Yeah, let's go with that.  Since ipath depends on 64BIT in Kconfig
> > anyway I think this is OK for now.
> 
> This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 
> instead of dma_addr_t causes the resulted patch to the IB ULPs to be 
> quite big.

I think it was you who pointed out that dma_addr_t is
32 bits on sparc64.  Did you have a different solution
in mind?

> Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) 
> consumer with this code?

Yes.


From michael.arndt at informatik.tu-chemnitz.de  Tue Dec 12 10:21:00 2006
From: michael.arndt at informatik.tu-chemnitz.de (Michael Arndt)
Date: Tue, 12 Dec 2006 19:21:00 +0100
Subject: [openib-general] mad_agents
Message-ID: <000e01c71e1a$46939ad0$21606d86@one7>

Hi,

the following statements about functions and modules refer to the mad.c, 
agent.c and user_mad.c file.

during the initialisation of the mad module a funktion ib_agent_port_open is 
called(ib_mad_init_device -> ib_mad_port_open). At this point an agent is 
registered (ib_register_mad_agent), without a MAD registration request 
applied. So my question is, what is this agent for?

And is it right that the agent registered by the umad module 
(ib_umad_ioctl -> ib_umad_reg_agent -> ib_register_mad_agent) gets all the 
SMP packets from the device and passes them to the SM (read and 
FileDescriptior).

What is about the SMA? Where are the SMPs filtered between SMA and SM?

I also would like to say that it would be really nice if there would be some 
papers, diagrams, grafics or anything else which explain how the whole 
openib system works. The source code as only reference isn't really helping 
for new developer.

Thanks Michael 


From rdreier at cisco.com  Tue Dec 12 10:30:49 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 10:30:49 -0800
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <457E6DAE.3040206@voltaire.com> (Or Gerlitz's message of
	"Tue, 12 Dec 2006 10:51:58 +0200")
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
	<50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
	<adar6v6ynmy.fsf@cisco.com> <457E6DAE.3040206@voltaire.com>
Message-ID: <adavekhufty.fsf@cisco.com>

 > This design of ib_dma_map_single, ib_sg_dma_address etc returning u64
 > instead of dma_addr_t causes the resulted patch to the IB ULPs to be
 > quite big.

Yes, there are actually some bugs introduced (basically pci_unmap_addr
et al can no longer be used).

I'll fix it up and test before merging.

 - R.


From adit.262 at gmail.com  Tue Dec 12 10:45:31 2006
From: adit.262 at gmail.com (Adit Ranadive)
Date: Tue, 12 Dec 2006 13:45:31 -0500
Subject: [openib-general] QoS configuration using opensm
Message-ID: <d2ad857f0612121045p272124e0kebe9a16a60d711ae@mail.gmail.com>

Hi,

Im trying to establish some QoS parameters for allowing apps to
communicate using different service levels.

Curently my opensm.opts looks like this:

# QoS default options
qos_max_vls 15
qos_high_limit 0
qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:255,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_vlarb_low 0:4,1:100,2:100,3:100,4:100,5:100,6:100,7:100,8:100,9:100,10:100,11:100,12:100,13:4,14:4
qos_sl2vl 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7

# QoS CA options
qos_ca_max_vls 15
qos_ca_high_limit 0
qos_ca_vlarb_high
0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
qos_ca_vlarb_low
0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

Im not sure which options to modify QoS default or QoS CA? Should both
tables have same values?
My setup includes no switch and back two machines connected to each
other using the IB cable.

Since im mapping service level 7 to all VLs all apps using sl=7 should
receive equal bandwidth? Also since in VLarb_high table weight of SL=7
is 255?

Thanks,
Regards,
Adit


From jlentini at netapp.com  Tue Dec 12 10:51:23 2006
From: jlentini at netapp.com (James Lentini)
Date: Tue, 12 Dec 2006 13:51:23 -0500 (EST)
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <457EEB07.8040904@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com>
	<457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com>
Message-ID: <Pine.LNX.4.64.0612121346510.20796@jlentini-linux.nane.netapp.com>


It appears that one or more of the receive work requests is completing 
in error. The crash occurs when the server attempts to cleanup the 
buffer associated with the work request.

I'd like to know why receives are failing. What is the error? Do your 
logs contain the printk on net/sunrpc/svc_rdma_recvfrom.c:522 
"svcrdma: bad WR completion..."? If they do not, you can turn on 
SVCRDMA_DEBUG (echo 4096 > /proc/sys/sunrpc/rpc_debug).

james

On Tue, 12 Dec 2006, Vu Pham wrote:

> James,
>   Another variation of put_page problem. I have stopped doing I/O or 
> accessing the mounted directory since last night. This morning I 
> just try to do *ls* the mounted directory and get this error
> 
> -vu
> 
> > James,
> >   I hit another variation of put_page problem. I just ran iozone with 9 GB
> > file size (both client and server machines have 8 GB of memory, dual
> > woodcrest xeon cpus, 2.6.18.5 kernel, nfsrdma release 7)
> > 
> > After this happened other nfsrdma clients can still do I/O to the server
> > 
> > -vu
> > 
> > > Hit *send* too soon - here is the objdump of swap.o
> > > 
> > > -vu
> > > 
> > > 
> > > > James Lentini wrote:
> > > > > A couple of questions Vu:
> > > > > 
> > > > > What NFS-RDMA release are you using? This looks like release 7.
> > > > > 
> > > > 
> > > > Yes. I'm using release 7
> > > > 
> > > > > Is this reproducible?
> > > > 
> > > > I ran into it twice - I think that it may co-relate to openSM restart
> > > > incident. I'll double check it and confirm
> > > > 
> > > > 
> > > > > What kernel version are you using?
> > > > 
> > > > 2.6.18.5
> > > > 
> > > > > What hardware is this on? It looks like x86-64 to me, which is fine. I
> > > > > just want to be sure I know what I'm looking at. As many specifics as
> > > > > possible is good (number of CPUs, hyperthreading, etc.)
> > > > > 
> > > > 
> > > > Dual woodcrest xeon based CPUs
> > > > 
> > > > > Could you send the output of
> > > > > objdump -Slr /path/to/kernel/mm/swap.o
> > > > > 
> > > > 
> > > > I attached the objdump output here
> > > > 
> > > > > Actually, just the put_page disassembly is all I want to see.
> > > > > 
> > > > > Is there any more text available? Usually there is an explanation
> > > > > given for an oops message (e.g. "Unable to handle kernel paging
> > > > > request..").
> > > > > 
> > > > 
> > > > I did not see any oops text message. System was still responsive with
> > > > ipoib ping or login
> > > > 
> > > > 
> > > > > I opened a bug at the NFS-RDMA SourceForge project to track this:
> > > > > 
> > > > > http://sourceforge.net/tracker/index.php?func=detail&aid=1613201&group_id=97628&atid=618583 
> > > > 
> > > > thanks for your help,
> > > > 
> > > > -vu
> > > > 
> > > > > Thanks for reporting this.
> > > > > james
> > > > > 
> > > > > On Fri, 8 Dec 2006, Vu Pham wrote:
> > > > > 
> > > > > > Hi James,
> > > > > >   I got these errors in server's /var/log/messages and then the
> > > > > > server stop
> > > > > > responding to login, I/O...; however, the server is still up, ipoib
> > > > > > is still
> > > > > > working
> > > > > > 
> > > > > > 
> > > > > > Dec  8 06:38:21 ibd201 kernel: RIP: 0010:[<ffffffff8025dff7>]
> > > > > > [<ffffffff8025dff7>] put_page+0x17/0x40
> > > > > > Dec  8 06:38:21 ibd201 kernel: RSP: 0018:ffff810219ddfb08  EFLAGS:
> > > > > > 00010246
> > > > > > Dec  8 06:38:21 ibd201 kernel: RAX: 0000000000000000 RBX:
> > > > > > 0000000000000001
> > > > > > RCX: 000000000003ffff
> > > > > > Dec  8 06:38:21 ibd201 kernel: RDX: 0000000000000000 RSI:
> > > > > > 0000000000000001
> > > > > > RDI: ffff8102274e92f8
> > > > > > Dec  8 06:38:21 ibd201 kernel: RBP: ffff8101ab785000 R08:
> > > > > > 0000000000000034
> > > > > > R09: 0000000000000000
> > > > > > Dec  8 06:38:21 ibd201 kernel: R10: 0000000000000000 R11:
> > > > > > 0000000000000000
> > > > > > R12: ffff81020ef96800
> > > > > > Dec  8 06:38:21 ibd201 kernel: R13: ffff8101ab785000 R14:
> > > > > > 0000000000000000
> > > > > > R15: ffff8102053ee890
> > > > > > Dec  8 06:38:21 ibd201 kernel: FS:  00002ad76b8acb00(0000)
> > > > > > GS:ffff81022066eb40(0000) knlGS:0000000000000000
> > > > > > Dec  8 06:38:21 ibd201 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > > 000000008005003b
> > > > > > Dec  8 06:38:21 ibd201 kernel: CR2: 00002aaaaabf1000 CR3:
> > > > > > 000000021c22b000
> > > > > > CR4: 00000000000006e0
> > > > > > Dec  8 06:38:21 ibd201 kernel: Process nfsd (pid: 15038, threadinfo
> > > > > > ffff810219dde000, task ffff81020d87f0c0)
> > > > > > Dec  8 06:38:21 ibd201 kernel: Stack:  ffffffff8835e547
> > > > > > ffff81020ef96968
> > > > > > ffff81020ef96800 ffff81020ef96958
> > > > > > Dec  8 06:38:21 ibd201 kernel:  ffffffff88360c72 000000010395dc90
> > > > > > ffffffff80424e05 0000000000000000
> > > > > > Dec  8 06:38:21 ibd201 kernel:  0000000000200200 000000010395dc90
> > > > > > ffffffff80239b90 ffff81020d87f0c0
> > > > > > Dec  8 06:38:21 ibd201 kernel: Call Trace:
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8835e547>]
> > > > > > :sunrpc:svc_rdma_put_context+0x37/0xd0
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff88360c72>]
> > > > > > :sunrpc:svc_rdma_recvfrom+0x5a2/0x11e0
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
> > > > > > schedule_timeout+0x95/0xb0
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80239b90>]
> > > > > > process_timeout+0x0/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80423c2d>]
> > > > > > wait_for_completion_timeout+0xcd/0x150
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> > > > > > default_wake_function+0x0/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff881c1402>]
> > > > > > :ib_mthca:mthca_cmd_post+0x232/0x260
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> > > > > > default_wake_function+0x0/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff802fac39>]
> > > > > > __next_cpu+0x19/0x30
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80227dae>]
> > > > > > find_busiest_group+0x24e/0x6d0
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424772>]
> > > > > > thread_return+0x0/0xde
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff804263f8>]
> > > > > > _spin_unlock_irqrestore+0x8/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a331>]
> > > > > > try_to_del_timer_sync+0x51/0x60
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8023a34c>]
> > > > > > del_timer_sync+0xc/0x20
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80424e05>]
> > > > > > schedule_timeout+0x95/0xb0
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883559e6>]
> > > > > > :sunrpc:svc_recv+0x416/0x510
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> > > > > > default_wake_function+0x0/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff80228db0>]
> > > > > > default_wake_function+0x0/0x10
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>]
> > > > > > :nfsd:nfsd+0x0/0x380
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9651>]
> > > > > > :nfsd:nfsd+0x111/0x380
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab9c>]
> > > > > > child_rip+0xa/0x12
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>]
> > > > > > :nfsd:nfsd+0x0/0x380
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff883a9540>]
> > > > > > :nfsd:nfsd+0x0/0x380
> > > > > > Dec  8 06:38:21 ibd201 kernel:  [<ffffffff8020ab92>]
> > > > > > child_rip+0x0/0x12
> > > > > > Dec  8 06:38:21 ibd201 kernel:
> > > > > > Dec  8 06:38:21 ibd201 kernel:
> > > > > > Dec  8 06:38:21 ibd201 kernel: Code: 0f 0b 68 8c 41 45 80 c2 2c 01
> > > > > > f0 ff 4f 08
> > > > > > 0f 94 c0 84 c0 74
> > > > > > Dec  8 06:38:21 ibd201 kernel: RIP  [<ffffffff8025dff7>]
> > > > > > put_page+0x17/0x40
> > > > > > Dec  8 06:38:21 ibd201 kernel:  RSP <ffff810219ddfb08>
> > > > > > 
> > > > > > -vu
> > > > > > 
> > > > 
> > > > 
> > 
> > ------------------------------------------------------------------------
> > 
> > <snip>
> > 
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please
> > bite here ] ---------
> > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
> > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30
> > ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd
> > exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod
> > button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca
> > shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000
> > floppy ext3 jbd megaraid_sas sd_mod scsi_mod
> > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
> > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]
> > [<ffffffff8025892b>] put_page+0x13/0x2e
> > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
> > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001
> > RCX: 0000000000006a53
> > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001
> > RDI: ffff81024fc3dec0
> > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001
> > R09: 0000000000000000
> > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8
> > R12: ffff810240fb3800
> > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400
> > R15: 00000000000dbba0
> > Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000)
> > GS:ffff81024688eac0(0000) knlGS:0000000000000000
> > Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> > 000000008005003b
> > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000
> > CR4: 00000000000006e0
> > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo
> > ffff81023fd10000, task ffff810246562840)
> > Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8
> > 0000000000000000 ffff81024172c5b0
> > Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000
> > ffffffff8817b2a8 0000000000000000
> > Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80
> > 0000000000000001 ffff81023cda7000
> > Dec 12 01:09:30 ibd202 kernel: Call Trace:
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>]
> > :sunrpc:svc_rdma_put_context+0x37/0xb5
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>]
> > :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>]
> > :sunrpc:svc_rdma_get_context+0x10c/0x128
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>]
> > :sunrpc:send_write+0x200/0x22c
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>]
> > generic_file_readv+0x8e/0xa7
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>]
> > zone_statistics+0x40/0x70
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>]
> > find_busiest_group+0x21f/0x66f
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>]
> > _spin_unlock_irq+0x6/0xa
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>]
> > _spin_lock_irqsave+0x9/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>]
> > lock_timer_base+0x1b/0x3c
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>]
> > try_to_del_timer_sync+0x4a/0x51
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>]
> > schedule_timeout+0x92/0xad
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>]
> > :sunrpc:svc_recv+0x3c5/0x4be
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
> > default_wake_function+0x0/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
> > default_wake_function+0x0/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
> > Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12
> > 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f
> > 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>]
> > put_page+0x13/0x2e
> > Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
> > Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> > Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
> > xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> > Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please
> > bite here ] ---------
> > Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
> > Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30
> > ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd
> > exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod
> > button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca
> > shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000
> > floppy ext3 jbd megaraid_sas sd_mod scsi_mod
> > Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
> > Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]
> > [<ffffffff8025892b>] put_page+0x13/0x2e
> > Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
> > Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001
> > RCX: 0000000000006a53
> > Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001
> > RDI: ffff81024fc3dec0
> > Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001
> > R09: 0000000000000000
> > Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8
> > R12: ffff810240fb3800
> > Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400
> > R15: 00000000000dbba0
> > Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000)
> > GS:ffff81024688eac0(0000) knlGS:0000000000000000
> > Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> > 000000008005003b
> > Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000
> > CR4: 00000000000006e0
> > Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo
> > ffff81023fd10000, task ffff810246562840)
> > Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8
> > 0000000000000000 ffff81024172c5b0
> > Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000
> > ffffffff8817b2a8 0000000000000000
> > Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80
> > 0000000000000001 ffff81023cda7000
> > Dec 12 01:09:30 ibd202 kernel: Call Trace:
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>]
> > :sunrpc:svc_rdma_put_context+0x37/0xb5
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>]
> > :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>]
> > :sunrpc:svc_rdma_get_context+0x10c/0x128
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>]
> > :sunrpc:send_write+0x200/0x22c
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>]
> > generic_file_readv+0x8e/0xa7
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>]
> > zone_statistics+0x40/0x70
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>]
> > find_busiest_group+0x21f/0x66f
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>]
> > _spin_unlock_irq+0x6/0xa
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>]
> > _spin_lock_irqsave+0x9/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>]
> > lock_timer_base+0x1b/0x3c
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>]
> > try_to_del_timer_sync+0x4a/0x51
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>]
> > schedule_timeout+0x92/0xad
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>]
> > :sunrpc:svc_recv+0x3c5/0x4be
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
> > default_wake_function+0x0/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
> > default_wake_function+0x0/0xe
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
> > Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
> > Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12
> > 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f
> > 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>]
> > put_page+0x13/0x2e
> > Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
> > Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
> > 
> > 
> > ------------------------------------------------------------------------
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> 
> 


From eeb at bartonsoftware.com  Tue Dec 12 10:53:09 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Tue, 12 Dec 2006 18:53:09 -0000
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <adaejr5xdyl.fsf@cisco.com>
Message-ID: <0a3901c71e1e$c431f910$0281a8c0@ebpc>

> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: 12 December 2006 4:43 PM
> To: Eric Barton
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] version #defines for the kernel
> 
>  > At the risk of flogging a dead horse - I was only thinking 
> of a very simple
>  > version number that incremented on change - something like
>  > LINUX_VERSION_CODE?
> 
> In that case what do you expect to see in a kernel with backported
> drivers, that has backported some changes but not others?

Blood one the floor somewhere I'd hope :)

Or maybe just no #define for the version, since the person doing the
backport clearly isn't worried about compatibility with out-of-tree
code.

                Cheers,
                        Eric


From mst at mellanox.co.il  Tue Dec 12 11:02:01 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 21:02:01 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061212064847.GB13509@mellanox.co.il>
References: <adad56q1g3t.fsf@cisco.com> <20061212064847.GB13509@mellanox.co.il>
Message-ID: <20061212190200.GE382@mellanox.co.il>

> Quoting r. Michael S. Tsirkin <mst at mellanox.co.il>:
> Subject: Re: [PATCHv2] IPoIB CM Experimental support
> 
> > I think we could probably merge it but maybe it's better to put it in
> > -mm for a cycle given that it's new and not too many people have
> > looked at it yet.  And I still haven't gotten comfortable with the way
> > CM is enabled.
> 
> Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP,
> or you don't want it in 2.6.20 anyway?

Roland, could you clarify your opinion pls?

-- 
MST


From rdreier at cisco.com  Tue Dec 12 11:12:00 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 11:12:00 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061212190200.GE382@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 12 Dec 2006 21:02:01 +0200")
References: <adad56q1g3t.fsf@cisco.com> <20061212064847.GB13509@mellanox.co.il>
	<20061212190200.GE382@mellanox.co.il>
Message-ID: <adamz5tudxb.fsf@cisco.com>

 > > Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP,
 > > or you don't want it in 2.6.20 anyway?
 > 
 > Roland, could you clarify your opinion pls?

Sorry, I thought about this a fair amount.  I think I finally ended up
feeling that the code is just too new.  I don't think anyone other
than you has had a chance to really look at it (I certainly haven't)
so I think we're better off not merging it.

I know that you said -mm has limited value but I actually think just
the build coverage is worth it.  And it is surprising how many people
are auditing the new code that shows up in -mm so I think it will help
a fair bit.

 - R.


From halr at voltaire.com  Tue Dec 12 11:12:21 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 12 Dec 2006 14:12:21 -0500
Subject: [openib-general] QoS configuration using opensm
In-Reply-To: <d2ad857f0612121045p272124e0kebe9a16a60d711ae@mail.gmail.com>
References: <d2ad857f0612121045p272124e0kebe9a16a60d711ae@mail.gmail.com>
Message-ID: <1165950734.28709.17589.camel@hal.voltaire.com>

Hi Adit,

On Tue, 2006-12-12 at 13:45, Adit Ranadive wrote:
> Hi,
> 
> Im trying to establish some QoS parameters for allowing apps to
> communicate using different service levels.
> 
> Curently my opensm.opts looks like this:
> 
> # QoS default options
> qos_max_vls 15
> qos_high_limit 0
> qos_vlarb_high 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:255,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> qos_vlarb_low 0:4,1:100,2:100,3:100,4:100,5:100,6:100,7:100,8:100,9:100,10:100,11:100,12:100,13:4,14:4
> qos_sl2vl 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7
> 
> # QoS CA options
> qos_ca_max_vls 15
> qos_ca_high_limit 0
> qos_ca_vlarb_high
> 0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
> qos_ca_vlarb_low
> 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
> qos_ca_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7
> 
> Im not sure which options to modify QoS default or QoS CA?

Depends what you want to do.

The defaults are used by all unless overridden by the specific
configuration by target port type (ca, rtr, sw0, ext).

> Should both tables have same values?

They can but if they do, you don't need one of them (likely the ca_
one).

> My setup includes no switch and back two machines connected to each
> other using the IB cable.
> 
> Since im mapping service level 7 to all VLs

No, it's the other way around: You are mapping all SLs to VL 7.

> all apps using sl=7 should receive equal bandwidth?

These tables only deal with arbitration amongst the VLs (and the mapping
of the SLs to VLs). They do not deal with fairness amongst applications
sharing the same SL.

> Also since in VLarb_high table weight of SL=7 is 255?

That setting means that the high priority limit can be unbounded and low
priority will only be scheduled if there is no high priority work to do.

-- Hal

> Thanks,
> Regards,
> Adit
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From rdreier at cisco.com  Tue Dec 12 11:21:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 11:21:18 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061212191805.GK382@mellanox.co.il> (Michael S. Tsirkin's
	message of "Tue, 12 Dec 2006 21:18:05 +0200")
References: <adamz5tudxb.fsf@cisco.com> <20061212191805.GK382@mellanox.co.il>
Message-ID: <adairggvs29.fsf@cisco.com>

 > OK, thanks for the suggestion. I'll send something to Andrew after -rc1's out.

Actually I can get stuff into -mm with no work at all -- Andrew just
grabs my for-mm branch.  So I'll drop CM into there and we'll be all set.

 > The memory registration speedup patches can be 2.6.20 material though, can't they?

Yes, they look OK.  I am merging stuff up right now and they are on my
list.

 - R.


From mst at mellanox.co.il  Tue Dec 12 11:18:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 21:18:05 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <adamz5tudxb.fsf@cisco.com>
References: <adamz5tudxb.fsf@cisco.com>
Message-ID: <20061212191805.GK382@mellanox.co.il>

>  > > Now I'm confused. Bottom line, should I try fixing up the enabling bit ASAP,
>  > > or you don't want it in 2.6.20 anyway?
>  > 
>  > Roland, could you clarify your opinion pls?
> 
> Sorry, I thought about this a fair amount.  I think I finally ended up
> feeling that the code is just too new.  I don't think anyone other
> than you has had a chance to really look at it (I certainly haven't)
> so I think we're better off not merging it.
> 
> I know that you said -mm has limited value but I actually think just
> the build coverage is worth it.  And it is surprising how many people
> are auditing the new code that shows up in -mm so I think it will help
> a fair bit.

OK, thanks for the suggestion. I'll send something to Andrew after -rc1's out.
The memory registration speedup patches can be 2.6.20 material though, can't they?


-- 
MST


From vuhuong at mellanox.com  Tue Dec 12 11:25:01 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 11:25:01 -0800
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <Pine.LNX.4.64.0612121346510.20796@jlentini-linux.nane.netapp.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com>
	<457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com>
	<Pine.LNX.4.64.0612121346510.20796@jlentini-linux.nane.netapp.com>
Message-ID: <457F020D.7010500@mellanox.com>

James Lentini wrote:
> It appears that one or more of the receive work requests is completing 
> in error. The crash occurs when the server attempts to cleanup the 
> buffer associated with the work request.
> 
> I'd like to know why receives are failing. What is the error? Do your 
> logs contain the printk on net/sunrpc/svc_rdma_recvfrom.c:522 
> "svcrdma: bad WR completion..."? If they do not, you can turn on 
> SVCRDMA_DEBUG (echo 4096 > /proc/sys/sunrpc/rpc_debug).
> 

Yes, this error message is original in my log messages.202 
and messages.202.1

see below

-vu

>>> ------------------------------------------------------------------------
>>>
>>> <snip>
>>>
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please
>>> bite here ] ---------
>>> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
>>> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30
>>> ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd
>>> exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod
>>> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca
>>> shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000
>>> floppy ext3 jbd megaraid_sas sd_mod scsi_mod
>>> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
>>> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]
>>> [<ffffffff8025892b>] put_page+0x13/0x2e
>>> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
>>> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001
>>> RCX: 0000000000006a53
>>> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001
>>> RDI: ffff81024fc3dec0
>>> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001
>>> R09: 0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8
>>> R12: ffff810240fb3800
>>> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400
>>> R15: 00000000000dbba0
>>> Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000)
>>> GS:ffff81024688eac0(0000) knlGS:0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000
>>> CR4: 00000000000006e0
>>> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo
>>> ffff81023fd10000, task ffff810246562840)
>>> Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8
>>> 0000000000000000 ffff81024172c5b0
>>> Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000
>>> ffffffff8817b2a8 0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80
>>> 0000000000000001 ffff81023cda7000
>>> Dec 12 01:09:30 ibd202 kernel: Call Trace:
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>]
>>> :sunrpc:svc_rdma_put_context+0x37/0xb5
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>]
>>> :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>]
>>> :sunrpc:svc_rdma_get_context+0x10c/0x128
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>]
>>> :sunrpc:send_write+0x200/0x22c
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>]
>>> generic_file_readv+0x8e/0xa7
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>]
>>> zone_statistics+0x40/0x70
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>]
>>> find_busiest_group+0x21f/0x66f
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>]
>>> _spin_unlock_irq+0x6/0xa
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>]
>>> _spin_lock_irqsave+0x9/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>]
>>> lock_timer_base+0x1b/0x3c
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>]
>>> try_to_del_timer_sync+0x4a/0x51
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>]
>>> schedule_timeout+0x92/0xad
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>]
>>> :sunrpc:svc_recv+0x3c5/0x4be
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
>>> default_wake_function+0x0/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
>>> default_wake_function+0x0/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
>>> Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12
>>> 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f
>>> 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>]
>>> put_page+0x13/0x2e
>>> Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
>>> Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596b800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81012596bc00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff810144c17c00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7de800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e7dec00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39800, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023dd39c00, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf000, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
>>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
>>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
>>> Dec 12 01:09:30 ibd202 kernel: ----------- [cut here ] --------- [please
>>> bite here ] ---------
>>> Dec 12 01:09:30 ibd202 kernel: Kernel BUG at include/linux/mm.h:300
>>> Dec 12 01:09:30 ibd202 kernel: invalid opcode: 0000 [1] SMP Dec 12 01:09:30
>>> ibd202 kernel: CPU 1 Dec 12 01:09:30 ibd202 kernel: Modules linked in: nfsd
>>> exportfs lockd nfs_acl ipv6 autofs4 sunrpc rdma_cm ib_addr dm_mirror dm_mod
>>> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core ib_mthca
>>> shpchp ib_ipoib ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core e1000
>>> floppy ext3 jbd megaraid_sas sd_mod scsi_mod
>>> Dec 12 01:09:30 ibd202 kernel: Pid: 4343, comm: nfsd Not tainted 2.6.18.5 #1
>>> Dec 12 01:09:30 ibd202 kernel: RIP: 0010:[<ffffffff8025892b>]
>>> [<ffffffff8025892b>] put_page+0x13/0x2e
>>> Dec 12 01:09:30 ibd202 kernel: RSP: 0018:ffff81023fd11b08  EFLAGS: 00010246
>>> Dec 12 01:09:30 ibd202 kernel: RAX: 0000000000000000 RBX: 0000000000000001
>>> RCX: 0000000000006a53
>>> Dec 12 01:09:30 ibd202 kernel: RDX: 00000000ffffff01 RSI: 0000000000000001
>>> RDI: ffff81024fc3dec0
>>> Dec 12 01:09:30 ibd202 kernel: RBP: ffff81023e4cf400 R08: 0000000000000001
>>> R09: 0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel: R10: 0000000000000000 R11: ffffffff88185ac8
>>> R12: ffff810240fb3800
>>> Dec 12 01:09:30 ibd202 kernel: R13: ffff810240fb3800 R14: ffff81023d045400
>>> R15: 00000000000dbba0
>>> Dec 12 01:09:30 ibd202 kernel: FS:  00002ad030296b00(0000)
>>> GS:ffff81024688eac0(0000) knlGS:0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>> Dec 12 01:09:30 ibd202 kernel: CR2: 00002b70add7aad8 CR3: 000000023ebd3000
>>> CR4: 00000000000006e0
>>> Dec 12 01:09:30 ibd202 kernel: Process nfsd (pid: 4343, threadinfo
>>> ffff81023fd10000, task ffff810246562840)
>>> Dec 12 01:09:30 ibd202 kernel: Stack:  ffffffff8817b2fb ffff810240fb39b8
>>> 0000000000000000 ffff81024172c5b0
>>> Dec 12 01:09:30 ibd202 kernel:  ffffffff8817ec67 ffff81023cda7000
>>> ffffffff8817b2a8 0000000000000000
>>> Dec 12 01:09:30 ibd202 kernel:  ffff81023fd11ca0 ffff81023fd11b80
>>> 0000000000000001 ffff81023cda7000
>>> Dec 12 01:09:30 ibd202 kernel: Call Trace:
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2fb>]
>>> :sunrpc:svc_rdma_put_context+0x37/0xb5
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817ec67>]
>>> :sunrpc:svc_rdma_recvfrom+0x58f/0x1150
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817b2a8>]
>>> :sunrpc:svc_rdma_get_context+0x10c/0x128
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8817d5b8>]
>>> :sunrpc:send_write+0x200/0x22c
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80254954>]
>>> generic_file_readv+0x8e/0xa7
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8025ba92>]
>>> zone_statistics+0x40/0x70
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80224401>]
>>> find_busiest_group+0x21f/0x66f
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a2e9>]
>>> _spin_unlock_irq+0x6/0xa
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff804285a3>] thread_return+0x64/0xec
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8042a259>]
>>> _spin_lock_irqsave+0x9/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233574>]
>>> lock_timer_base+0x1b/0x3c
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233776>]
>>> try_to_del_timer_sync+0x4a/0x51
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80233789>] del_timer_sync+0xc/0x16
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80428f6a>]
>>> schedule_timeout+0x92/0xad
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88174070>]
>>> :sunrpc:svc_recv+0x3c5/0x4be
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
>>> default_wake_function+0x0/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff80225264>]
>>> default_wake_function+0x0/0xe
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff88204407>] :nfsd:nfsd+0x10d/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4ac>] child_rip+0xa/0x12
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff882042fa>] :nfsd:nfsd+0x0/0x359
>>> Dec 12 01:09:30 ibd202 kernel:  [<ffffffff8020a4a2>] child_rip+0x0/0x12
>>> Dec 12 01:09:30 ibd202 kernel: Dec 12 01:09:30 ibd202 kernel: Dec 12
>>> 01:09:30 ibd202 kernel: Code: 0f 0b 68 16 4d 45 80 c2 2c 01 f0 ff 4f 08 0f
>>> 94 c0 84 c0 74 Dec 12 01:09:30 ibd202 kernel: RIP  [<ffffffff8025892b>]
>>> put_page+0x13/0x2e
>>> Dec 12 01:09:30 ibd202 kernel:  RSP <ffff81023fd11b08>
>>> Dec 12 01:09:30 ibd202 kernel:  <4>nfsd: terminating on error 22
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>


From rdreier at cisco.com  Tue Dec 12 11:30:38 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 11:30:38 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com> (Sean
	Hefty's message of "Thu, 30 Nov 2006 16:53:41 -0800")
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
Message-ID: <adaejr4vrmp.fsf@cisco.com>

OK, I merged 1..5 up for 2.6.20.  I had to fix a few conflicts with
cma.c cleanups already upstream, and also with the miscdevice
conversion from class device to real device (basically what steve wise
posted a few days ago).

I just pushed the result out in my for-2.6.20 branch if anyone wants
to check, and I'll ask Linus to pull soon.

I did have one question, but we can clean it up later:

 > +		if (signal_pending(current)) {
 > +			ret = -ERESTARTSYS;
 > +			break;
 > +		}
 > +
 > +		prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE);
 > +		mutex_unlock(&file->mut);
 > +		schedule();
 > +		mutex_lock(&file->mut);
 > +		finish_wait(&file->poll_wait, &wait);

is there any reason why this can't just be written with
wait_event_interruptible() instead of this more-complex way?

 - R.


From rdreier at cisco.com  Tue Dec 12 11:50:30 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 11:50:30 -0800
Subject: [openib-general] [ANNOUNCE] - Chelsio T3 Git Repositories
In-Reply-To: <1165864386.6867.2.camel@stevo-desktop> (Steve Wise's
	message of "Mon, 11 Dec 2006 13:13:06 -0600")
References: <BAE9DCEF64577A439B3A37F36F9B691C014EB221@orsmsx418.amr.corp.intel.com>
	<1165864386.6867.2.camel@stevo-desktop>
Message-ID: <aday7pcuc55.fsf@cisco.com>

 > Hey Roland, is there a preferred way to handle this?  IE whats the best
 > was of keeping a 2.6.19 based patch set while also trying to merge your
 > patches into the latest from linus's tree? 
 > 
 > I guess I can create a branch with a HEAD at 2.6.19 and back-port my
 > latest patch set.  Is that the best way?  Maybe a for-ofed branch?

I don't know if there's a great way to handle it.  Basically you need
a branch based at the v2.6.19 tag.  There's probably a smart way to
keep the merge between your latest stuff and the backports, but I'm
not sure what it would be exactly.

 - R.


From vishal at endace.com  Tue Dec 12 12:21:00 2006
From: vishal at endace.com (vishal)
Date: Wed, 13 Dec 2006 09:21:00 +1300
Subject: [openib-general] srp initiator device discovery
In-Reply-To: <457E707A.4040802@mellanox.com>
References: <mailman.374.1165886944.18259.openib-general@openib.org>
	<1165899109.14308.9.camel@julia.et.endace.com>
	<457E707A.4040802@mellanox.com>
Message-ID: <1165954860.14308.12.camel@julia.et.endace.com>

Hi,

    I have only a single cable connecting the initiator and the target
machine...

Thanks!

Vishal


On Tue, 2006-12-12 at 01:03 -0800, Vu Pham wrote:
> How many cable did you connect from your host to fabric?
> 
> If you have two cables (2 ports of same hca or each port of 
> 2 hcas) connected then you have two paths to same srp 
> target. Each path will see the same number of luns of srp 
> target. You can work with dm-multipath/multipath and access 
> the luns/devices thru /dev/mapper - this will provide you 
> capability of fail-over/fail-back functionality
> 
> IBGD's srp target only works with scsi devices. It does not 
> work with block devices (hdX, md, lvm volules ...)
> 
> -vu
> 
> > Hi,
> > 
> >    I have srp initiator installed with OFED-1.1, and another machine
> > with SRP target (IBGOLD). I started the srp daemon to discover the
> > target devices, and then ran fdisk -l to see the list. The list (below)
> > shows duplicate devices :-
> > 
> > Disk /dev/sdb: 2199.0 GB, 2199023255552 bytes
> > 255 heads, 63 sectors/track, 267349 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> > Disk /dev/sdb doesn't contain a valid partition table
> > 
> > Disk /dev/sdc: 2199.0 GB, 2199023255552 bytes
> > 255 heads, 63 sectors/track, 267349 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > 
> > Disk /dev/sdd: 500.1 GB, 500107862016 bytes
> > 255 heads, 63 sectors/track, 60801 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sdd1   *           1          13      104391   83  Linux
> > /dev/sdd2              14       60801   488279610   8e  Linux LVM
> > 
> > Disk /dev/sde: 2199.0 GB, 2199023255552 bytes
> > 255 heads, 63 sectors/track, 267349 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> > Disk /dev/sde doesn't contain a valid partition table
> > 
> > Disk /dev/sdf: 2199.0 GB, 2199023255552 bytes
> > 255 heads, 63 sectors/track, 267349 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > 
> > Disk /dev/sdg: 500.1 GB, 500107862016 bytes
> > 255 heads, 63 sectors/track, 60801 cylinders
> > Units = cylinders of 16065 * 512 = 8225280 bytes
> > 
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sdg1   *           1          13      104391   83  Linux
> > /dev/sdg2              14       60801   488279610   8e  Linux LVM
> > 
> > 
> > 
> > Doing some tests I found that sdb=sde, sdc=sdf, and sdd=sdg (obvious).
> > 
> > I also tested the device discovery after creating an md device on the
> > target side, and found that the initiator doesn't take into account the
> > presence of an md device. Is this the expected behaviour ?
> > 
> > Thanks for your time!
> > 
> > Vishal
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 
> 


From tom at opengridcomputing.com  Tue Dec 12 12:23:34 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 12 Dec 2006 14:23:34 -0600
Subject: [openib-general] nfsrdma server stop responding,
In-Reply-To: <457F020D.7010500@mellanox.com>
References: <4579C6C3.5090207@mellanox.com>
	<Pine.LNX.4.64.0612111004000.20796@jlentini-linux.nane.netapp.com>
	<457E0516.2050009@mellanox.com> <457E069A.4020807@mellanox.com>
	<457E7414.6040802@mellanox.com> <457EEB07.8040904@mellanox.com>
	<Pine.LNX.4.64.0612121346510.20796@jlentini-linux.nane.netapp.com>
	<457F020D.7010500@mellanox.com>
Message-ID: <1165955014.8722.82.camel@trinity.ogc.int>

This is just the normal shutdown path. The WR completions are flushes (see status==5).

On Tue, 2006-12-12 at 11:25 -0800, Vu Pham wrote:
[...snip...]
> >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion
> >>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
...........................................................^

The bug is that the rmda ctxt is the same for these two WR and
that will cause the same pages to be free twice.

> >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5
> >>> Dec 12 01:09:30 ibd202 kernel: svcrdma: bad WR completion

............................................................v
> >>> Dec 12 01:09:30 ibd202 kernel: 	ctxt=ffff81023e4cf400, count=1 on
> >>> xprt=ffff810240fb3800, rqstp=ffff81023d045400, status=5


From mshefty at ichips.intel.com  Tue Dec 12 11:50:36 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Tue, 12 Dec 2006 11:50:36 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <adaejr4vrmp.fsf@cisco.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<adaejr4vrmp.fsf@cisco.com>
Message-ID: <457F080C.2090202@ichips.intel.com>

>  > +		if (signal_pending(current)) {
>  > +			ret = -ERESTARTSYS;
>  > +			break;
>  > +		}
>  > +
>  > +		prepare_to_wait(&file->poll_wait, &wait, TASK_INTERRUPTIBLE);
>  > +		mutex_unlock(&file->mut);
>  > +		schedule();
>  > +		mutex_lock(&file->mut);
>  > +		finish_wait(&file->poll_wait, &wait);
> 
> is there any reason why this can't just be written with
> wait_event_interruptible() instead of this more-complex way?

I don't think so.  The code followed the ucm, which is likely whatever Libor had 
done.  Did umad or uverbs follow this same format at some point?  In any case, 
this and the ucm could probably both be cleaned up.

- Sean


From tom at opengridcomputing.com  Tue Dec 12 12:36:05 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 12 Dec 2006 14:36:05 -0600
Subject: [openib-general] RNFS double page free fix
Message-ID: <1165955765.8722.88.camel@trinity.ogc.int>

Vu:

Thanks for finding this bug. I think I have a fix. 
Can you please apply it to your server and see if it 
fixes the problem for you too?

Thanks,
Tom

Double page free on session shutdown

From: Tom Tucker <tom at opengridcomputing.com>

---

 net/sunrpc/svc_rdma_recvfrom.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c
index ec62000..059f5ff 100644
--- a/net/sunrpc/svc_rdma_recvfrom.c
+++ b/net/sunrpc/svc_rdma_recvfrom.c
@@ -527,6 +527,7 @@ int svc_rdma_recvfrom(struct svc_rqst *r
 		/* Close the transport */
 		set_bit(SK_CLOSE, &xprt->sk_flags);
 		svc_rdma_put_context(ctxt, 1);
+		ctxt = NULL;
 		goto poll_dto_q;
 	}
 

From rdreier at cisco.com  Tue Dec 12 12:37:01 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 12:37:01 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061210134137.GL29174@mellanox.co.il> (Michael S.
	Tsirkin's message of "Sun, 10 Dec 2006 15:41:37 +0200")
References: <20061129140016.GO5061@mellanox.co.il>
	<20061205161944.GD30209@mellanox.co.il>
	<20061210134137.GL29174@mellanox.co.il>
Message-ID: <adau000u9zm.fsf@cisco.com>

OK, I merged this up into an ipoib-cm branch and merged it into for-mm
as well.  I had to fix some work-struct related stuff and a few other
conflicts, so please look at what I did.  Testing wouldn't hurt either
(I didn't have a chance to do more than build it yet).

 - R.


From rdreier at cisco.com  Tue Dec 12 12:42:17 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 12:42:17 -0800
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <1165879256.19459.379.camel@localhost> (Matt Leininger's
	message of "Mon, 11 Dec 2006 15:20:56 -0800")
References: <200612051222.kB5CMDJQ017085@robert.bartonsoftware.com>
	<1165879256.19459.379.camel@localhost>
Message-ID: <adapsaou9qu.fsf@cisco.com>

 >   Roland may be able to comment on if their are performance difference
 > for interrupt-drive CQ between the old VAPI stacks and OFED.

I think OFED is probably faster than any other stack I know of...

I think MST's idea of PCI tuning issues is probably right.  Can you
send the output of

    lspci -vxxx -d15b3:

with the two stacks?

 - R.


From mst at mellanox.co.il  Tue Dec 12 12:49:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 12 Dec 2006 22:49:04 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <adau000u9zm.fsf@cisco.com>
References: <adau000u9zm.fsf@cisco.com>
Message-ID: <20061212204904.GM382@mellanox.co.il>

> OK, I merged this up into an ipoib-cm branch and merged it into for-mm
> as well.  I had to fix some work-struct related stuff and a few other
> conflicts, so please look at what I did.  Testing wouldn't hurt either
> (I didn't have a chance to do more than build it yet).

OK, thanks. I'll look at it tomorrow.

-- 
MST


From vu at mellanox.com  Tue Dec 12 14:46:03 2006
From: vu at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 14:46:03 -0800
Subject: [openib-general] RNFS double page free fix
In-Reply-To: <1165955765.8722.88.camel@trinity.ogc.int>
References: <1165955765.8722.88.camel@trinity.ogc.int>
Message-ID: <457F312B.8060706@mellanox.com>

Tom,
  Thanks a lot. This patch seem to fix the double page free problem

-vu

> Vu:
>
> Thanks for finding this bug. I think I have a fix. 
> Can you please apply it to your server and see if it 
> fixes the problem for you too?
>
> Thanks,
> Tom
>
> Double page free on session shutdown
>
> From: Tom Tucker <tom at opengridcomputing.com>
>
> ---
>
>  net/sunrpc/svc_rdma_recvfrom.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/net/sunrpc/svc_rdma_recvfrom.c b/net/sunrpc/svc_rdma_recvfrom.c
> index ec62000..059f5ff 100644
> --- a/net/sunrpc/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/svc_rdma_recvfrom.c
> @@ -527,6 +527,7 @@ int svc_rdma_recvfrom(struct svc_rqst *r
>  		/* Close the transport */
>  		set_bit(SK_CLOSE, &xprt->sk_flags);
>  		svc_rdma_put_context(ctxt, 1);
> +		ctxt = NULL;
>  		goto poll_dto_q;
>  	}
>  
>
>   


From vu at mellanox.com  Tue Dec 12 15:01:07 2006
From: vu at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 15:01:07 -0800
Subject: [openib-general] nfsrdma release 7 issues,
Message-ID: <457F34B3.9060402@mellanox.com>

James,
  Beside the double page free issue that Tom already fixed, I see the 
following issues:
1. simultaneous nfsrdmamount from multiple host issue. I see the 
following error messages
...
Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for 
QP=ffff810240f5fa00
Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
QP=ffff810240f5f000
Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
QP=ffff810242cfa400

2.  While some clients run I/Os, one idle client try to access the mount 
point ie. *ls* and get I/O input error. I see these error messages on 
server log

Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22
Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion
Dec 12 13:58:29 ibd202 kernel:  ctxt=ffff810242130800, count=1 on 
xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5
...
Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for 
unknown QP 2e0408

Then the mount point is inaccessible from all clients

3. performance issue - I got max 450 MB/s  read from server cache 
(comparing to 800 MB/s with release 6, using the same hw configuration 
for both client/server)

thanks,
-vu


From tom at opengridcomputing.com  Tue Dec 12 15:36:14 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Tue, 12 Dec 2006 17:36:14 -0600
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <457F34B3.9060402@mellanox.com>
References: <457F34B3.9060402@mellanox.com>
Message-ID: <1165966574.8722.110.camel@trinity.ogc.int>

Vu:

See below...

On Tue, 2006-12-12 at 15:01 -0800, Vu Pham wrote:
> James,
>   Beside the double page free issue that Tom already fixed, I see the 
> following issues:
> 1. simultaneous nfsrdmamount from multiple host issue. I see the 
> following error messages
> ...
> Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for 
> QP=ffff810240f5fa00
> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
> QP=ffff810240f5f000
> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
> QP=ffff810242cfa400

This is the known race in the ib cm that resulted in the addition of the
rdma_establish interface. For RNFS it is a benign message, but I do need
to add the call ...I'm not fond of the rdma_establish solution so I've
dragged my feet...Thanks for reminding me ;-)

> 
> 2.  While some clients run I/Os, one idle client try to access the mount 
> point ie. *ls* and get I/O input error. I see these error messages on 
> server log
> 
> Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22
> Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion
> Dec 12 13:58:29 ibd202 kernel:  ctxt=ffff810242130800, count=1 on 
> xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5
> ...
> Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for 
> unknown QP 2e0408
> 
> Then the mount point is inaccessible from all clients

Ooh. This looks bad. This isn't concurrent with issue 1. above is it?
Was the "idle" client idle for more than 6 minutes? 

> 
> 3. performance issue - I got max 450 MB/s  read from server cache 
> (comparing to 800 MB/s with release 6, using the same hw configuration 
> for both client/server)
> 

Oof... 

1. I get much better than this on my MTD1000 hardware with SDR. Can you
send me your .config?

2. Can you please send me the iozone test parameters your using?

Thanks,
Tom
> thanks,
> -vu


From vuhuong at mellanox.com  Tue Dec 12 15:59:39 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Tue, 12 Dec 2006 15:59:39 -0800
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <1165966574.8722.110.camel@trinity.ogc.int>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
Message-ID: <457F426B.7020104@mellanox.com>

Tom,

> Vu:
> 
> See below...
> 
> On Tue, 2006-12-12 at 15:01 -0800, Vu Pham wrote:
>> James,
>>   Beside the double page free issue that Tom already fixed, I see the 
>> following issues:
>> 1. simultaneous nfsrdmamount from multiple host issue. I see the 
>> following error messages
>> ...
>> Dec 12 13:31:40 ibd202 kernel: svcrdma: QP event 4 received for 
>> QP=ffff810240f5fa00
>> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
>> QP=ffff810240f5f000
>> Dec 12 13:34:17 ibd202 kernel: svcrdma: QP event 4 received for 
>> QP=ffff810242cfa400
> 
> This is the known race in the ib cm that resulted in the addition of the
> rdma_establish interface. For RNFS it is a benign message, but I do need
> to add the call ...I'm not fond of the rdma_establish solution so I've
> dragged my feet...Thanks for reminding me ;-)
> 

You will hear it from me from release to release ;-)


>> 2.  While some clients run I/Os, one idle client try to access the mount 
>> point ie. *ls* and get I/O input error. I see these error messages on 
>> server log
>>
>> Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22
>> Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion
>> Dec 12 13:58:29 ibd202 kernel:  ctxt=ffff810242130800, count=1 on 
>> xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5
>> ...
>> Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for 
>> unknown QP 2e0408
>>
>> Then the mount point is inaccessible from all clients
> 
> Ooh. This looks bad. This isn't concurrent with issue 1. above is it?

No, I don't think 1,2 are related

> Was the "idle" client idle for more than 6 minutes? 

Yes

> 
>> 3. performance issue - I got max 450 MB/s  read from server cache 
>> (comparing to 800 MB/s with release 6, using the same hw configuration 
>> for both client/server)
>>
> 
> Oof... 
> 
> 1. I get much better than this on my MTD1000 hardware with SDR. Can you
> send me your .config?

Please find it in the attachment

> 
> 2. Can you please send me the iozone test parameters your using?
> 

server has 8GB of mem, client has 2GB of mem

iozone -r 64KB -s 5g -i 0 -i 1
and
iozone -r 64KB -s 2g -i 0 -i 1 -t 3

thanks,
-vu

> Thanks,
> Tom
>> thanks,
>> -vu
> 

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: .config
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061212/75b50ae5/attachment.ksh>

From rdreier at cisco.com  Tue Dec 12 16:16:29 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Tue, 12 Dec 2006 16:16:29 -0800
Subject: [openib-general] [GIT PULL] please pull infiniband.git
Message-ID: <ada8xhctztu.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

Finishing up the major 2.6.20 merges, plus some fixes:

Krishna Kumar (1):
      RDMA/amso1100: Fix memory leak in c2_qp_modify()

Ralph Campbell (6):
      IB: Add DMA mapping functions to allow device drivers to interpose
      IB/ipath: Implement new verbs DMA mapping functions
      IB/core: Use the new verbs DMA mapping functions
      IPoIB: Use the new verbs DMA mapping functions
      IB/srp: Use new verbs IB DMA mapping functions
      IB/iser: Use the new verbs DMA mapping functions

Roland Dreier (5):
      IB/fmr: ib_flush_fmr_pool() may wait too long
      IB/ipath: Remove unused "write-only" variables
      IB/iser: Remove unused "write-only" variables
      IB/ipath: Fix IRQ for PCI Express HCAs
      IPoIB: Make sure struct ipoib_neigh.queue is always initialized

Sean Hefty (5):
      RDMA/cma: Remove unneeded qp_type parameter from rdma_cm
      RDMA/cma: Report connect info with connect events
      RDMA/cma: Allow early transition to RTS to handle lost CM messages
      RDMA/cma: Add support for RDMA_PS_UDP
      RDMA/cma: Export rdma cm interface to userspace

 drivers/infiniband/core/Makefile              |    6 +-
 drivers/infiniband/core/cm.c                  |    4 +
 drivers/infiniband/core/cma.c                 |  416 +++++++++---
 drivers/infiniband/core/fmr_pool.c            |   12 +-
 drivers/infiniband/core/mad.c                 |   90 ++--
 drivers/infiniband/core/mad_priv.h            |    6 +-
 drivers/infiniband/core/ucma.c                |  874 +++++++++++++++++++++++++
 drivers/infiniband/core/uverbs_marshall.c     |    5 +-
 drivers/infiniband/core/uverbs_mem.c          |   12 +-
 drivers/infiniband/hw/amso1100/c2_qp.c        |   13 +-
 drivers/infiniband/hw/ipath/Makefile          |    1 +
 drivers/infiniband/hw/ipath/ipath_dma.c       |  189 ++++++
 drivers/infiniband/hw/ipath/ipath_driver.c    |    4 +-
 drivers/infiniband/hw/ipath/ipath_file_ops.c  |    5 +-
 drivers/infiniband/hw/ipath/ipath_iba6110.c   |    3 +-
 drivers/infiniband/hw/ipath/ipath_iba6120.c   |    8 +-
 drivers/infiniband/hw/ipath/ipath_init_chip.c |    3 +-
 drivers/infiniband/hw/ipath/ipath_intr.c      |    3 +-
 drivers/infiniband/hw/ipath/ipath_keys.c      |    8 +-
 drivers/infiniband/hw/ipath/ipath_mr.c        |    7 +-
 drivers/infiniband/hw/ipath/ipath_sysfs.c     |    3 -
 drivers/infiniband/hw/ipath/ipath_verbs.c     |    1 +
 drivers/infiniband/hw/ipath/ipath_verbs.h     |    2 +
 drivers/infiniband/ulp/ipoib/ipoib.h          |    4 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c       |   75 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c     |    3 +-
 drivers/infiniband/ulp/iser/iscsi_iser.h      |    2 +-
 drivers/infiniband/ulp/iser/iser_initiator.c  |    4 -
 drivers/infiniband/ulp/iser/iser_memory.c     |  125 ++--
 drivers/infiniband/ulp/srp/ib_srp.c           |   81 ++-
 drivers/infiniband/ulp/srp/ib_srp.h           |    2 +-
 include/rdma/ib_marshall.h                    |    5 +-
 include/rdma/ib_verbs.h                       |  253 +++++++
 include/rdma/rdma_cm.h                        |   62 ++-
 include/rdma/rdma_cm_ib.h                     |    3 +
 include/rdma/rdma_user_cm.h                   |  206 ++++++
 36 files changed, 2146 insertions(+), 354 deletions(-)
 create mode 100644 drivers/infiniband/core/ucma.c
 create mode 100644 drivers/infiniband/hw/ipath/ipath_dma.c
 create mode 100644 include/rdma/rdma_user_cm.h


From eitan at sw053.yok.mtl.com  Tue Dec 12 21:13:01 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Wed, 13 Dec 2006 07:13:01 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-13:normal completion
Message-ID: <200612130513.kBD5D1se025785@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = ____  
ibutils rev = ____  
Total=572 Pass=571 Fail=1

Pass:
78 Stability IS1-16.topo
78 Pkey IS1-16.topo
78 Multicast IS1-16.topo
78 LidMgr IS1-16.topo
77 OsmStress IS1-16.topo
26 Stability IS3-loop.topo
26 Stability IS3-128.topo
26 Pkey IS3-128.topo
26 OsmStress IS3-128.topo
26 Multicast IS3-loop.topo
26 Multicast IS3-128.topo
26 LidMgr IS3-128.topo

Failures:
1 OsmStress IS1-16.topo


From k_mahesh85 at yahoo.co.in  Tue Dec 12 22:55:13 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Wed, 13 Dec 2006 06:55:13 +0000 (GMT)
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
	driver
Message-ID: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com>

Hello all,

I want to know from u people that isi it necessary to implement the process_mad for a HCA.

After looking into the implementations of process_mad in ipath and mthca drivers i have fount that they are used to reply the MADs with port_info,gid_info,sm_info etc..

But isn't it handled by SMA in the host...... i am little bit confused now .
please just whether  it is required to implement process_mad (suppose) for new HCA driver....if it is required  why?

Please CC your replies to me.

regards,
K.Mahesh.


---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061213/9f6bfdba/attachment.html>

From ogerlitz at voltaire.com  Tue Dec 12 23:54:28 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 13 Dec 2006 09:54:28 +0200
Subject: [openib-general] [PATCH v4 2/7] IB/ipath - Implement new verbs
 DMA mapping functions
In-Reply-To: <1165947778.14800.315.camel@brick.pathscale.com>
References: <1165517253.14800.283.camel@brick.pathscale.com>
	<457BD18D.7000403@voltaire.com>
	<50951.71.131.43.73.1165860146.squirrel@rocky.pathscale.com>
	<adar6v6ynmy.fsf@cisco.com> <457E6DAE.3040206@voltaire.com>
	<1165947778.14800.315.camel@brick.pathscale.com>
Message-ID: <457FB1B4.2010809@voltaire.com>

Ralph Campbell wrote:
> On Tue, 2006-12-12 at 10:51 +0200, Or Gerlitz wrote:
>> Roland Dreier wrote:
>>>  > I would like to see this last set of patches integrated as is.
>>>  > I would like to get more experience with the current implementation
>>>  > before extending it to support other configurations.
>>>
>>> Yeah, let's go with that.  Since ipath depends on 64BIT in Kconfig
>>> anyway I think this is OK for now.
>> This design of ib_dma_map_single, ib_sg_dma_address etc returning u64 
>> instead of dma_addr_t causes the resulted patch to the IB ULPs to be 
>> quite big.
> 
> I think it was you who pointed out that dma_addr_t is
> 32 bits on sparc64.  Did you have a different solution
> in mind?

To be precise, I have pointed on a problem and you have come with the 
solution of having ib_dma_map_xxx work with u64 instead of dma_addr_t.

As Roland suggested, you could implement SW IOTLB that works with 
dma_addr_t and you have chosen not to.

>> Have you tested any dma_map single (eg IPoIB) and sg (eg SRP or iSER) 
>> consumer with this code?

> Yes.

The new API (eg ib_dma_map_xxx, ib_sg_dma_address and ib_sa_dma_len) 
adds some branching on each call, I wonder if you have seen any 
performance difference before/after the change. Specifically with IPoIB 
running a test with many PPS (ie iperf udp) or SRP IOPS test ?

Or.


From ogerlitz at voltaire.com  Wed Dec 13 00:22:03 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 13 Dec 2006 10:22:03 +0200
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <ada8xhctztu.fsf@cisco.com>
References: <ada8xhctztu.fsf@cisco.com>
Message-ID: <457FB82B.4090902@voltaire.com>

Roland Dreier wrote:
> Linus, please pull from
> 
>     master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> This tree is also available from kernel.org mirrors at:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus
> 
> Finishing up the major 2.6.20 merges, plus some fixes:

Roland,

you have CC-ed lkml at cisco.com on this email, is there a chance you 
wanted to CC linux-kernel at vger.kernel.org instead ...

May i ask what prevented the v3 of the mthca profile patch (see 
http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in?

Or.

> 
> Krishna Kumar (1):
>       RDMA/amso1100: Fix memory leak in c2_qp_modify()
> 
> Ralph Campbell (6):
>       IB: Add DMA mapping functions to allow device drivers to interpose
>       IB/ipath: Implement new verbs DMA mapping functions
>       IB/core: Use the new verbs DMA mapping functions
>       IPoIB: Use the new verbs DMA mapping functions
>       IB/srp: Use new verbs IB DMA mapping functions
>       IB/iser: Use the new verbs DMA mapping functions
> 
> Roland Dreier (5):
>       IB/fmr: ib_flush_fmr_pool() may wait too long
>       IB/ipath: Remove unused "write-only" variables
>       IB/iser: Remove unused "write-only" variables
>       IB/ipath: Fix IRQ for PCI Express HCAs
>       IPoIB: Make sure struct ipoib_neigh.queue is always initialized
> 
> Sean Hefty (5):
>       RDMA/cma: Remove unneeded qp_type parameter from rdma_cm
>       RDMA/cma: Report connect info with connect events
>       RDMA/cma: Allow early transition to RTS to handle lost CM messages
>       RDMA/cma: Add support for RDMA_PS_UDP
>       RDMA/cma: Export rdma cm interface to userspace
> 
>  drivers/infiniband/core/Makefile              |    6 +-
>  drivers/infiniband/core/cm.c                  |    4 +
>  drivers/infiniband/core/cma.c                 |  416 +++++++++---
>  drivers/infiniband/core/fmr_pool.c            |   12 +-
>  drivers/infiniband/core/mad.c                 |   90 ++--
>  drivers/infiniband/core/mad_priv.h            |    6 +-
>  drivers/infiniband/core/ucma.c                |  874 +++++++++++++++++++++++++
>  drivers/infiniband/core/uverbs_marshall.c     |    5 +-
>  drivers/infiniband/core/uverbs_mem.c          |   12 +-
>  drivers/infiniband/hw/amso1100/c2_qp.c        |   13 +-
>  drivers/infiniband/hw/ipath/Makefile          |    1 +
>  drivers/infiniband/hw/ipath/ipath_dma.c       |  189 ++++++
>  drivers/infiniband/hw/ipath/ipath_driver.c    |    4 +-
>  drivers/infiniband/hw/ipath/ipath_file_ops.c  |    5 +-
>  drivers/infiniband/hw/ipath/ipath_iba6110.c   |    3 +-
>  drivers/infiniband/hw/ipath/ipath_iba6120.c   |    8 +-
>  drivers/infiniband/hw/ipath/ipath_init_chip.c |    3 +-
>  drivers/infiniband/hw/ipath/ipath_intr.c      |    3 +-
>  drivers/infiniband/hw/ipath/ipath_keys.c      |    8 +-
>  drivers/infiniband/hw/ipath/ipath_mr.c        |    7 +-
>  drivers/infiniband/hw/ipath/ipath_sysfs.c     |    3 -
>  drivers/infiniband/hw/ipath/ipath_verbs.c     |    1 +
>  drivers/infiniband/hw/ipath/ipath_verbs.h     |    2 +
>  drivers/infiniband/ulp/ipoib/ipoib.h          |    4 +-
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c       |   75 +--
>  drivers/infiniband/ulp/ipoib/ipoib_main.c     |    3 +-
>  drivers/infiniband/ulp/iser/iscsi_iser.h      |    2 +-
>  drivers/infiniband/ulp/iser/iser_initiator.c  |    4 -
>  drivers/infiniband/ulp/iser/iser_memory.c     |  125 ++--
>  drivers/infiniband/ulp/srp/ib_srp.c           |   81 ++-
>  drivers/infiniband/ulp/srp/ib_srp.h           |    2 +-
>  include/rdma/ib_marshall.h                    |    5 +-
>  include/rdma/ib_verbs.h                       |  253 +++++++
>  include/rdma/rdma_cm.h                        |   62 ++-
>  include/rdma/rdma_cm_ib.h                     |    3 +
>  include/rdma/rdma_user_cm.h                   |  206 ++++++
>  36 files changed, 2146 insertions(+), 354 deletions(-)
>  create mode 100644 drivers/infiniband/core/ucma.c
>  create mode 100644 drivers/infiniband/hw/ipath/ipath_dma.c
>  create mode 100644 include/rdma/rdma_user_cm.h


From halr at voltaire.com  Wed Dec 13 03:43:39 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Dec 2006 06:43:39 -0500
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com>
References: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com>
Message-ID: <1166010208.28709.59772.camel@hal.voltaire.com>

On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote:
> Hello all,
> 
> I want to know from u people that isi it necessary to implement the
> process_mad for a HCA.
> 
> After looking into the implementations of process_mad in ipath and
> mthca drivers i have fount that they are used to reply the MADs with
> port_info,gid_info,sm_info etc..
> 
> But isn't it handled by SMA in the host......

The SMA can either be in the host on in firmware (as is typical with the
Mellanox silicon).

> i am little bit confused now .
> please just whether  it is required to implement process_mad (suppose)
> for new HCA driver....

It is. For an example of a host (software SMA), see
drivers/infiniband/hw/ipath/ipath_mad.c

> if it is required  why?

The driver is needed to obtain the information for the IB node to fill
in the MADs for response to the SMA query. It may also issue some traps.
Similarly for PMA as well.

-- Hal

> Please CC your replies to me.
> 
> regards,
> K.Mahesh.
> 
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
>  Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From mst at mellanox.co.il  Wed Dec 13 03:49:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 13 Dec 2006 13:49:16 +0200
Subject: [openib-general] [PATCH] mthca: move code from post send to post
	receive
Message-ID: <20061213114916.GA23726@mellanox.co.il>

Place SQ wrid's first in wrid buffer, to eliminate an add operation
in the send datapath.

This keeps binary size constant, moving code from post send to post receive:
post send is a latency-sensitive operation, while post receive is done
beforehand, so it's not.  Additionally, a generic ULP mixing send and RDMA does
more post sends than post receives (RDMA does not have a matching post receive).

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>
Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

While unlikely to give a large gain, this makes sense to me.
Please consider for 2.6.20.

diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 149b369..433f9a8 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -537,8 +537,7 @@ static inline int mthca_poll_one(struct 
 		wq = &(*cur_qp)->sq;
 		wqe_index = ((be32_to_cpu(cqe->wqe) - (*cur_qp)->send_wqe_offset)
 			     >> wq->wqe_shift);
-		entry->wr_id = (*cur_qp)->wrid[wqe_index +
-					       (*cur_qp)->rq.max];
+		entry->wr_id = (*cur_qp)->wrid[wqe_index];
 	} else if ((*cur_qp)->ibqp.srq) {
 		struct mthca_srq *srq = to_msrq((*cur_qp)->ibqp.srq);
 		u32 wqe = be32_to_cpu(cqe->wqe);
@@ -558,7 +557,7 @@ static inline int mthca_poll_one(struct 
 		 */
 		if (unlikely(wqe_index < 0))
 			wqe_index = wq->max - 1;
-		entry->wr_id = (*cur_qp)->wrid[wqe_index];
+		entry->wr_id = (*cur_qp)->wrid[wqe_index + (*cur_qp)->sq.max];
 	}
 
 	if (wq) {
diff --git a/drivers/infiniband/hw/mthca/mthca_qp.c b/drivers/infiniband/hw/mthca/mthca_qp.c
index 6a7822e..9e6f715 100644
--- a/drivers/infiniband/hw/mthca/mthca_qp.c
+++ b/drivers/infiniband/hw/mthca/mthca_qp.c
@@ -1690,7 +1690,7 @@ int mthca_tavor_post_send(struct ib_qp *
 			size += sizeof (struct mthca_data_seg) / 16;
 		}
 
-		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+		qp->wrid[ind] = wr->wr_id;
 
 		if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) {
 			mthca_err(dev, "opcode invalid\n");
@@ -1810,7 +1810,7 @@ int mthca_tavor_post_receive(struct ib_q
 			size += sizeof (struct mthca_data_seg) / 16;
 		}
 
-		qp->wrid[ind] = wr->wr_id;
+		qp->wrid[ind + qp->sq.max] = wr->wr_id;
 
 		((struct mthca_next_seg *) prev_wqe)->nda_op =
 			cpu_to_be32((ind << qp->rq.wqe_shift) | 1);
@@ -2068,7 +2068,7 @@ int mthca_arbel_post_send(struct ib_qp *
 			size += sizeof (struct mthca_data_seg) / 16;
 		}
 
-		qp->wrid[ind + qp->rq.max] = wr->wr_id;
+		qp->wrid[ind] = wr->wr_id;
 
 		if (wr->opcode >= ARRAY_SIZE(mthca_opcode)) {
 			mthca_err(dev, "opcode invalid\n");
@@ -2192,7 +2192,7 @@ int mthca_arbel_post_receive(struct ib_q
 			((struct mthca_data_seg *) wqe)->addr = 0;
 		}
 
-		qp->wrid[ind] = wr->wr_id;
+		qp->wrid[ind + qp->sq.max] = wr->wr_id;
 
 		++ind;
 		if (unlikely(ind >= qp->rq.max))


-- 
MST


From halr at voltaire.com  Wed Dec 13 03:52:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 13 Dec 2006 06:52:43 -0500
Subject: [openib-general] mad_agents
In-Reply-To: <000e01c71e1a$46939ad0$21606d86@one7>
References: <000e01c71e1a$46939ad0$21606d86@one7>
Message-ID: <1166010756.28709.60158.camel@hal.voltaire.com>

Hi Michael,

On Tue, 2006-12-12 at 13:21, Michael Arndt wrote:
> Hi,
> 
> the following statements about functions and modules refer to the mad.c, 
> agent.c and user_mad.c file.
> 
> during the initialisation of the mad module a funktion ib_agent_port_open is 
> called(ib_mad_init_device -> ib_mad_port_open). At this point an agent is 
> registered (ib_register_mad_agent), without a MAD registration request 
> applied. So my question is, what is this agent for?

When there is no registration, that means those agents are "send only"
agents. "Send only" means the agent will only receive solicited
responses and will not receive any unsolicited MADs.

Those agents that are started are for SMI (QP0) and GSI (QP1). The SMA
sits on QP0 (shared with SM). Many GS agents (including the PMA, also
SA) sit on top of QP1.

> And is it right that the agent registered by the umad module 
> (ib_umad_ioctl -> ib_umad_reg_agent -> ib_register_mad_agent) gets all the 
> SMP packets from the device and passes them to the SM (read and 
> FileDescriptior).

user_mad registrations occur via the ioctl. It only gets those packets
it registers for. These can include SMPs as well as GMPs depending on
user agents registered. The diagnostics use these (DR SMPs, LR SMPs, and
GMPs).

The agent only gets those MADs that the SMA does not handle. This is
done via the status passed back to process_mad (IB_MAD_RESULT_XXXXX).

The SM registers for request/response matching on both SM and SA classes
with different method masks (as different methods apply). There are also
some unsolicited receives (e.g. traps) to be handled.

When request/response matching is used, the agent is determined by the
high 32 bits of the transaction ID which is overwritten in the (send of
the) request. Those 32 bits are the agent ID and used for demux to the
proper agent when the response (or timeout) occurs.

> What is about the SMA? Where are the SMPs filtered between SMA and SM?

process_mad in the MAD layer passes them to the driver (mthca_mad.c for
one example) for filtering. This filtering is based on the status
returned (IB_MAD_RESULT_XXXX in ib_mad.h).

> I also would like to say that it would be really nice if there would be some 
> papers, diagrams, grafics or anything else which explain how the whole 
> openib system works. The source code as only reference isn't really helping 
> for new developer.

Yes, that would be nice. Perhaps you can help here.

-- Hal

> Thanks Michael 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From yangdong at ncic.ac.cn  Wed Dec 13 04:29:29 2006
From: yangdong at ncic.ac.cn (yangdong)
Date: Wed, 13 Dec 2006 20:29:29 +0800
Subject: [openib-general] non-transparent integration with SDP in ofed
Message-ID: <457FF229.9020306@ncic.ac.cn>

Hello all:

Some problems disturbed me. For non-transparent integration with SDP, no
special environment variable is necessary. It is only required to
recompile the application replacing AF_INET with AF_INET_SDP. The
constant AF_INET_SDP is defined in sdp_inet.h.

When i want to make use of non-transparent integration with SDP in
kernel, i can replace PF_INET with PF_INET_SDP and AF_INET with
AF_INET_SDP, then recompile the kernel module.

e.g. sock_create_kern (PF_INET_SDP, SOCK_STREAM, IPPROTO_TCP, &new_sock);
sin.sin_family = AF_INET;

That did well when i use IBGD, but it can't work with OFED. I want to
know what should I do.


Some info :
./ibstat
CA 'mthca0'
CA type: MT23108
Number of ports: 2
Firmware version: 3.3.2
Hardware version: a1
Node GUID: 0x0002c90200004c68
System image GUID: 0x0002c90200004c6b
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 7
LMC: 0
SM lid: 33
Capability mask: 0x00510a68
Port GUID: 0x0002c90200004c69
Port 2:
State: Down
Physical state: Polling
Rate: 2
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00510a68
Port GUID: 0x0002c90200004c6a


From tziporet at mellanox.co.il  Wed Dec 13 08:06:45 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Wed, 13 Dec 2006 18:06:45 +0200
Subject: [openib-general] OFED 1.2 howto update
Message-ID: <45802515.20605@mellanox.co.il>

Hi All,
I added a documentation on the Wiki how to add/change component for the 
OFED 1.2 development package:
https://openib.org/tiki/tiki-index.php?page=OFED+1.2+HowTo

These instructions should be used to add the new components that we 
agreed in previous meeting (VNIC, iWARP)

Please look and send me any questions you have - and I will be able to 
improve this page.

Tziporet


From bos at pathscale.com  Wed Dec 13 09:56:25 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 13 Dec 2006 09:56:25 -0800
Subject: [openib-general] version #defines for the kernel
In-Reply-To: <0a3901c71e1e$c431f910$0281a8c0@ebpc>
References: <0a3901c71e1e$c431f910$0281a8c0@ebpc>
Message-ID: <45803EC9.7020004@pathscale.com>

Eric Barton wrote:

> Blood one the floor somewhere I'd hope :)
> 
> Or maybe just no #define for the version, since the person doing the
> backport clearly isn't worried about compatibility with out-of-tree
> code.

You're better off planning for the backport mess than hoping for API 
version definitions that will not be reliably present.  Getting driver 
code to compile will be the least of your worries.

	<b


From bos at pathscale.com  Wed Dec 13 08:57:19 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 13 Dec 2006 09:57:19 -0700
Subject: [openib-general] [PATCH 0 of 2] Add memcpy_uncached_read,
 a memcpy that doesn't cache reads
Message-ID: <patchbomb.1166032639@eng-12.pathscale.com>

Hi, Andrew -

Here's a suitably renamed uncached-read memcpy.  I hope the name is now
self-explanatory.

	<b


From bos at pathscale.com  Wed Dec 13 08:57:21 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 13 Dec 2006 09:57:21 -0700
Subject: [openib-general] [PATCH 2 of 2] IB/ipath - use memcpy_uncached_read
 in RDMA interrupt handler to reduce packet loss
In-Reply-To: <patchbomb.1166032639@eng-12.pathscale.com>
Message-ID: <f25d77f7699889775581.1166032641@eng-12.pathscale.com>

In cases where a large incoming RDMA is being received, we have to
copy data inside the interrupt handler before we can ACK each packet.
The source is DMAed to by the hardware, which means that the CPU won't
have it cached.  We only read the source this one time; using normal load
instructions pollutes the dcache with useless data, reducing performance
to the point where we can lose a significant number of packets.

We use memcpy_uncached_read to try to not fill the dcache with useless data.
Avoiding the cache refill penalty lets us keep up better with the sender,
resulting in many fewer dropped packets.

Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r e7c3b265254b -r f25d77f76998 drivers/infiniband/hw/ipath/ipath_verbs.c
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Dec 13 09:51:09 2006 -0800
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c	Wed Dec 13 09:51:09 2006 -0800
@@ -167,7 +167,7 @@ void ipath_copy_sge(struct ipath_sge_sta
 		BUG_ON(len == 0);
 		if (len > length)
 			len = length;
-		memcpy(sge->vaddr, data, len);
+		memcpy_uncached_read(sge->vaddr, data, len);
 		sge->vaddr += len;
 		sge->length -= len;
 		sge->sge_length -= len;


From bos at pathscale.com  Wed Dec 13 08:57:20 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Wed, 13 Dec 2006 09:57:20 -0700
Subject: [openib-general] [PATCH 1 of 2] Add memcpy_uncached_read,
 a memcpy that tries to reduce cache pressure
In-Reply-To: <patchbomb.1166032639@eng-12.pathscale.com>
Message-ID: <e7c3b265254b705286f1.1166032640@eng-12.pathscale.com>

This copy routine is memcpy-compatible, but on some architectures will use
cache-bypassing loads to avoid bringing the source data into the cache.

One case where this is useful is when a device issues a DMA to a memory
region, and the CPU must copy the DMAed data elsewhere before doing any
work with it.  Since the source data is read-once, write-never from the
CPU's perspective, caching the data at those addresses can only evict
potentially useful data.

We provide an x86_64 implementation that uses SSE non-temporal loads,
and a generic version that falls back to plain memcpy.

Implementors for other arches should not use cache-bypassing stores to
the destination, as in most cases, the destination is accessed almost
immediately after a copy finishes.

Signed-off-by: Bryan O'Sullivan <bryan.osullivan at qlogic.com>

diff -r 4a0c3ede5076 -r e7c3b265254b arch/x86_64/lib/Makefile
--- a/arch/x86_64/lib/Makefile	Tue Dec 12 10:43:21 2006 -0800
+++ b/arch/x86_64/lib/Makefile	Wed Dec 13 09:51:09 2006 -0800
@@ -9,4 +9,5 @@ lib-y := csum-partial.o csum-copy.o csum
 lib-y := csum-partial.o csum-copy.o csum-wrappers.o delay.o \
 	usercopy.o getuser.o putuser.o  \
 	thunk.o clear_page.o copy_page.o bitstr.o bitops.o
-lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o
+lib-y += memcpy.o memmove.o memset.o copy_user.o rwlock.o \
+	memcpy_uncached_read.o
diff -r 4a0c3ede5076 -r e7c3b265254b arch/x86_64/lib/memcpy_uncached_read.S
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/arch/x86_64/lib/memcpy_uncached_read.S	Wed Dec 13 09:51:09 2006 -0800
@@ -0,0 +1,142 @@
+/*
+ * Copyright (c) 2006 QLogic Corporation.  All Rights Reserved.
+ *
+ * This file is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
+ */
+
+/*
+ * memcpy_uncached_read - memcpy-compatible copy routine, using streaming loads
+ * @dest: destination address
+ * @src: source address (will not be cached)
+ * @count: number of bytes to copy
+ *
+ * Use streaming loads and normal stores for a special-case copy where
+ * we know we won't be reading the source again, but will be reading the
+ * destination again soon.
+ */
+	.text
+	.p2align 4,,15
+	/* rdi  destination, rsi source, rdx count */
+	.globl	memcpy_uncached_read
+	.type	memcpy_uncached_read, @function
+memcpy_uncached_read:
+	movq	%rdi, %rax
+.L5:
+	cmpq	$15, %rdx
+	ja	.L34
+.L3:
+	cmpl	$8, %edx	/* rdx is 0..15 */
+	jbe	.L9
+.L6:
+	testb	$8, %dxl	/* rdx is 3,5,6,7,9..15 */
+	je	.L13
+	movq	(%rsi), %rcx
+	addq	$8, %rsi
+	movq	%rcx, (%rdi)
+	addq	$8, %rdi
+.L13:
+	testb	$4, %dxl
+	je	.L15
+	movl	(%rsi), %ecx
+	addq	$4, %rsi
+	movl	%ecx, (%rdi)
+	addq	$4, %rdi
+.L15:
+	testb	$2, %dxl
+	je	.L17
+	movzwl	(%rsi), %ecx
+	addq	$2, %rsi
+	movw	%cx, (%rdi)
+	addq	$2, %rdi
+.L17:
+	testb	$1, %dxl
+	je	.L33
+.L1:
+	movzbl	(%rsi), %ecx
+	movb	%cl, (%rdi)
+.L33:
+	ret
+.L34:
+	cmpq	$63, %rdx	/* rdx is > 15 */
+	ja	.L64
+	movl	$16, %ecx	/* rdx is 16..63 */
+.L25:
+	movq	8(%rsi), %r8
+	movq	(%rsi), %r9
+	addq	%rcx, %rsi
+	movq	%r8, 8(%rdi)
+	movq	%r9, (%rdi)
+	addq	%rcx, %rdi
+	subq	%rcx, %rdx
+	cmpl	%edx, %ecx	/* is rdx >= 16? */
+	jbe	.L25
+	jmp	.L3		/* rdx is 0..15 */
+	.p2align 4,,7
+.L64:
+	movl	$64, %ecx
+.L42:
+	prefetchnta	128(%rsi)
+	movq	(%rsi), %r8
+	movq	8(%rsi), %r9
+	movq	16(%rsi), %r10
+	movq	24(%rsi), %r11
+	subq	%rcx, %rdx
+	movq	%r8, (%rdi)
+	movq	32(%rsi), %r8
+	movq	%r9, 8(%rdi)
+	movq	40(%rsi), %r9
+	movq	%r10, 16(%rdi)
+	movq	48(%rsi), %r10
+	movq	%r11, 24(%rdi)
+	movq	56(%rsi), %r11
+	addq	%rcx, %rsi
+	movq	%r8, 32(%rdi)
+	movq	%r9, 40(%rdi)
+	movq	%r10, 48(%rdi)
+	movq	%r11, 56(%rdi)
+	addq	%rcx, %rdi
+	cmpq	%rdx, %rcx	/* is rdx >= 64? */
+	jbe	.L42
+	sfence
+	orl	%edx, %edx
+	je	.L33
+	jmp	.L5
+.L9:
+	jmp	*.L12(,%rdx,8)	/* rdx is 0..8 */
+	.section	.rodata
+	.align 8
+	.align 4
+.L12:
+	.quad	.L33
+	.quad	.L1
+	.quad	.L2
+	.quad	.L6
+	.quad	.L4
+	.quad	.L6
+	.quad	.L6
+	.quad	.L6
+	.quad	.L8
+	.text
+.L2:
+	movzwl	(%rsi), %ecx
+	movw	%cx, (%rdi)
+	ret
+.L4:
+	movl	(%rsi), %ecx
+	movl	%ecx, (%rdi)
+	ret
+.L8:
+	movq	(%rsi), %rcx
+	movq	%rcx, (%rdi)
+	ret
diff -r 4a0c3ede5076 -r e7c3b265254b include/asm-x86_64/string.h
--- a/include/asm-x86_64/string.h	Tue Dec 12 10:43:21 2006 -0800
+++ b/include/asm-x86_64/string.h	Wed Dec 13 09:51:09 2006 -0800
@@ -39,6 +39,8 @@ extern void *__memcpy(void *to, const vo
 		 __ret = __builtin_memcpy((dst),(src),__len);	\
 	   __ret; }) 
 
+#define __HAVE_ARCH_MEMCPY_UNCACHED_READ
+extern void *memcpy_uncached_read(void *to, const void *from, size_t len); 
 
 #define __HAVE_ARCH_MEMSET
 void *memset(void *s, int c, size_t n);
diff -r 4a0c3ede5076 -r e7c3b265254b include/linux/string.h
--- a/include/linux/string.h	Tue Dec 12 10:43:21 2006 -0800
+++ b/include/linux/string.h	Wed Dec 13 09:51:09 2006 -0800
@@ -85,6 +85,9 @@ extern void * memset(void *,int,__kernel
 #ifndef __HAVE_ARCH_MEMCPY
 extern void * memcpy(void *,const void *,__kernel_size_t);
 #endif
+#ifndef __HAVE_ARCH_MEMCPY_UNCACHED_READ
+#define memcpy_uncached_read(dest, src, count) memcpy((dest), (src), (count))
+#endif
 #ifndef __HAVE_ARCH_MEMMOVE
 extern void * memmove(void *,const void *,__kernel_size_t);
 #endif


From philippe.bernadat at hp.com  Wed Dec 13 10:02:09 2006
From: philippe.bernadat at hp.com (Philippe Bernadat)
Date: Wed, 13 Dec 2006 19:02:09 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc>
References: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc>
Message-ID: <45804021.9050209@hp.com>

Roland,

Attached are the two lspci outputs.

The only differences I see are:

[philippe at hamish o2ib]$ diff lspci.vib lspci.ofed
1d0
< pcilib: Resource 5 in /sys/bus/pci/devices/0000:00:1f.1/resource has a 
64-bit address, ignoring
40c39
< 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
---
 > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
[philippe at hamish o2ib]$


 >  >   Roland may be able to comment on if their are performance 
difference
 >  > for interrupt-drive CQ between the old VAPI stacks and OFED.
 >
 > I think OFED is probably faster than any other stack I know of...
 >
 > I think MST's idea of PCI tuning issues is probably right.  Can you
 > send the output of
 >
 >    lspci -vxxx -d15b3:
 >
 > with the two stacks?
 >
 >  - R.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lspci.ofed
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061213/0eed2e4c/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lspci.vib
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061213/0eed2e4c/attachment-0001.ksh>

From mst at mellanox.co.il  Wed Dec 13 10:09:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 13 Dec 2006 20:09:16 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration by
 filling MTTs directly
In-Reply-To: <20061212151039.GJ26613@mellanox.co.il>
References: <20061212151039.GJ26613@mellanox.co.il>
Message-ID: <20061213180916.GA1689@mellanox.co.il>

Speed up memory registration by filling in MTTs directly.  This reduces the
number of FW commands needed to register an MR by at least a factor of 2.  This
applies to all memfree cards, and to tavor mode on 64 bit systems with the patch
I posted earlier.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

Roland, the previous version of this patch had a bug on memfree.
I noticed you didn't push these patches out to Linus yet so I did a
re-spin. Let me know if you prefer an incremental patch.

This applies on top of "make all MRs accessible for FMR mapping".

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_dev.h
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -464,6 +464,8 @@ void mthca_uar_free(struct mthca_dev *de
 int mthca_pd_alloc(struct mthca_dev *dev, int privileged, struct mthca_pd *pd);
 void mthca_pd_free(struct mthca_dev *dev, struct mthca_pd *pd);
 
+int mthca_write_mtt_size(struct mthca_dev *dev);
+
 struct mthca_mtt *mthca_alloc_mtt(struct mthca_dev *dev, int size);
 void mthca_free_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt);
 int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_mr.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_mr.c
@@ -244,8 +244,8 @@ void mthca_free_mtt(struct mthca_dev *de
 	kfree(mtt);
 }
 
-int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
-		    int start_index, u64 *buffer_list, int list_len)
+static int __mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			     int start_index, u64 *buffer_list, int list_len)
 {
 	struct mthca_mailbox *mailbox;
 	__be64 *mtt_entry;
@@ -296,6 +296,84 @@ out:
 	return err;
 }
 
+void mthca_tavor_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	u64 __iomem *mtts;
+	u32 mtt_seg;
+	int i;
+
+	mtt_seg = mtt->first_seg * MTHCA_MTT_SEG_SIZE;
+       	mtts = dev->mr_table.tavor_fmr.mtt_base + mtt_seg + start_index * sizeof (u64);
+	for (i = 0; i < list_len; ++i) {
+		__be64 mtt_entry = cpu_to_be64(buffer_list[i] |
+					       MTHCA_MTT_FLAG_PRESENT);
+		mthca_write64_raw(mtt_entry, mtts + i);
+	}
+}
+
+void mthca_arbel_write_mtt_seg(struct mthca_dev *dev, struct mthca_mtt *mtt,
+			      int start_index, u64 *buffer_list, int list_len)
+{
+	__be64 *mtts;
+	int i;
+	int s = start_index * sizeof (u64);
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	BUG_ON(s / PAGE_SIZE != (s + list_len * sizeof(u64) - 1) / PAGE_SIZE);
+	/* Require full segments */
+	BUG_ON(s % MTHCA_MTT_SEG_SIZE);
+
+	mtts = mthca_table_find(dev->mr_table.mtt_table, mtt->first_seg +
+				s / MTHCA_MTT_SEG_SIZE);
+
+	BUG_ON(!mtts);
+
+	for (i = 0; i < list_len; ++i)
+		mtts[i] = cpu_to_be64(buffer_list[i] | MTHCA_MTT_FLAG_PRESENT);
+}
+
+int mthca_write_mtt_size(struct mthca_dev *dev)
+{
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		/*
+		 * Be friendly to WRITE_MTT command
+		 * and leave two empty slots for the
+		 * index and reserved fields of the
+		 * mailbox.
+		 */
+		return PAGE_SIZE / sizeof (u64) - 2;
+
+	/* For Arbel, all MTTs must fit in the same page. */
+	return mthca_is_memfree(dev) ? (PAGE_SIZE / sizeof (u64)) : 0x7ffffff;
+}
+
+int mthca_write_mtt(struct mthca_dev *dev, struct mthca_mtt *mtt,
+		    int start_index, u64 *buffer_list, int list_len)
+{
+	int size = mthca_write_mtt_size(dev);
+	int chunk;
+
+	if (dev->mr_table.fmr_mtt_buddy != &dev->mr_table.mtt_buddy)
+		return __mthca_write_mtt(dev, mtt, start_index, buffer_list, list_len);
+
+	while (list_len > 0) {
+		chunk = min(size, list_len);
+		if (mthca_is_memfree(dev))
+			mthca_arbel_write_mtt_seg(dev, mtt, start_index,
+						  buffer_list, chunk);
+		else
+			mthca_tavor_write_mtt_seg(dev, mtt, start_index,
+						  buffer_list, chunk);
+
+		list_len    -= chunk;
+		start_index += chunk;
+		buffer_list += chunk;
+	}
+
+	return 0;
+}
+
 static inline u32 tavor_hw_index_to_key(u32 ind)
 {
 	return ind;
Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_provider.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1015,6 +1015,7 @@ static struct ib_mr *mthca_reg_user_mr(s
 	int shift, n, len;
 	int i, j, k;
 	int err = 0;
+	int write_mtt_size;
 
 	shift = ffs(region->page_size) - 1;
 
@@ -1040,6 +1041,8 @@ static struct ib_mr *mthca_reg_user_mr(s
 
 	i = n = 0;
 
+	write_mtt_size = min(mthca_write_mtt_size(dev), (int)(PAGE_SIZE / sizeof *pages));
+
 	list_for_each_entry(chunk, &region->chunk_list, list)
 		for (j = 0; j < chunk->nmap; ++j) {
 			len = sg_dma_len(&chunk->page_list[j]) >> shift;
@@ -1047,14 +1050,11 @@ static struct ib_mr *mthca_reg_user_mr(s
 				pages[i++] = sg_dma_address(&chunk->page_list[j]) +
 					region->page_size * k;
 				/*
-				 * Be friendly to WRITE_MTT command
-				 * and leave two empty slots for the
-				 * index and reserved fields of the
-				 * mailbox.
+				 * Be friendly to write_mtt and pass it chunks
+				 * of appropriate size.
 				 */
-				if (i == PAGE_SIZE / sizeof (u64) - 2) {
-					err = mthca_write_mtt(dev, mr->mtt,
-							      n, pages, i);
+				if (i == write_mtt_size) {
+					err = mthca_write_mtt(dev, mr->mtt, n, pages, i);
 					if (err)
 						goto mtt_done;
 					n += i;
-- 
MST


From eitan at mellanox.co.il  Wed Dec 13 07:52:50 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 13 Dec 2006 17:52:50 +0200
Subject: [openib-general] non-transparent integration with SDP in ofed
In-Reply-To: <457FF229.9020306@ncic.ac.cn>
References: <457FF229.9020306@ncic.ac.cn>
Message-ID: <458021D2.1070205@mellanox.co.il>

Hi

A change in SDP was done between IBGD and OFED:
You only need to replace the address family to AF_INET_SDP when opening
the socket.
You should not use AF_INET_SDP when providing a struct sockaddr_in.
These should stay AF_INET.

Hope this helps.

EZ

yangdong wrote:
> Hello all:
>
> Some problems disturbed me. For non-transparent integration with SDP, no
> special environment variable is necessary. It is only required to
> recompile the application replacing AF_INET with AF_INET_SDP. The
> constant AF_INET_SDP is defined in sdp_inet.h.
>
> When i want to make use of non-transparent integration with SDP in
> kernel, i can replace PF_INET with PF_INET_SDP and AF_INET with
> AF_INET_SDP, then recompile the kernel module.
>
> e.g. sock_create_kern (PF_INET_SDP, SOCK_STREAM, IPPROTO_TCP, &new_sock);
> sin.sin_family = AF_INET;
>
> That did well when i use IBGD, but it can't work with OFED. I want to
> know what should I do.
>
>
>
> Some info :
> ./ibstat
> CA 'mthca0'
> CA type: MT23108
> Number of ports: 2
> Firmware version: 3.3.2
> Hardware version: a1
> Node GUID: 0x0002c90200004c68
> System image GUID: 0x0002c90200004c6b
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 10
> Base lid: 7
> LMC: 0
> SM lid: 33
> Capability mask: 0x00510a68
> Port GUID: 0x0002c90200004c69
> Port 2:
> State: Down
> Physical state: Polling
> Rate: 2
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x00510a68
> Port GUID: 0x0002c90200004c6a
>
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Brian.Cain at ge.com  Wed Dec 13 14:09:27 2006
From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare))
Date: Wed, 13 Dec 2006 17:09:27 -0500
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP users
 who didn't RTFM
Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com>

There's gotta be a good way to let people know they're going down the
wrong path on this one.

Signed-off-by: Brian Cain <Brian.Cain at ge.com>

--- ofed/openib/scripts/install.sh      2006-12-13 14:48:51.747995000
-0700
+++ ofed_fix/openib/scripts/install.sh  2006-12-13 14:59:00.586574000
-0700
@@ -1070,6 +1070,14 @@
                         echo "# Load SDP module" >>
${IB_CONF_DIR}/openib.conf
                         echo "# SDP_LOAD=no" >>
${IB_CONF_DIR}/openib.conf
                 fi
+
+
+                if [[ "$srp" == "y" || "$srp_target" == "y" ]] &&
+                   [[ $(egrep 'flags.*lm' /proc/cpuinfo | wc -l) > 0 ]]
&&
+                   [[ $(uname -p | egrep 'i[3-9]86' | wc -l) > 0 ]];
then
+                   echo '!!WARNING!! SRP is not supported for 32-bit OS
running on 64-bit capable hardware'
+                fi
+

                 if [ "$srp" == "y" ]; then
                         echo >> ${IB_CONF_DIR}/openib.conf

--
-Brian 


From jlentini at netapp.com  Wed Dec 13 14:09:45 2006
From: jlentini at netapp.com (James Lentini)
Date: Wed, 13 Dec 2006 17:09:45 -0500 (EST)
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <457F426B.7020104@mellanox.com>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
Message-ID: <Pine.LNX.4.64.0612131552340.20796@jlentini-linux.nane.netapp.com>


On Tue, 12 Dec 2006, Vu Pham wrote:

> > > 2.  While some clients run I/Os, one idle client try to access the mount
> > > point ie. *ls* and get I/O input error. I see these error messages on
> > > server log

Was there anything in the log before this point? I'd expect to see a 
message started with "svcrdma: failed to post SQ..."

> > > Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22
> > > Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion
> > > Dec 12 13:58:29 ibd202 kernel:  ctxt=ffff810242130800, count=1 on
> > > xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5
> > > ...
> > > Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for unknown
> > > QP 2e0408


From rdreier at cisco.com  Wed Dec 13 14:21:29 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:21:29 -0800
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com>
	(Brian Cain's message of "Wed, 13 Dec 2006 17:09:27 -0500")
References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEF19@CINMLVEM11.e2k.ad.ge.com>
Message-ID: <ada8xhbsahi.fsf@cisco.com>

 > +                   echo '!!WARNING!! SRP is not supported for 32-bit OS running on 64-bit capable hardware'

Did I miss something?  Why doesn't SRP work with 32-bit userspace on a
64-bit capable hardware?  In fact why doesn't it work with 32-bit
userspace on a 64-bit kernel?

 - R.


From rdreier at cisco.com  Wed Dec 13 14:27:52 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:27:52 -0800
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <20061213180916.GA1689@mellanox.co.il> (Michael S.
	Tsirkin's message of "Wed, 13 Dec 2006 20:09:16 +0200")
References: <20061212151039.GJ26613@mellanox.co.il>
	<20061213180916.GA1689@mellanox.co.il>
Message-ID: <ada4przsa6v.fsf@cisco.com>

I was going to apply this, but then I realized that mthca is screwed
up on non-cache-coherent CPUs with memfree HCAs, and this patch makes
things much worse.  The problem is that we allocate the MTT table with
alloc_pages() and then do pci_map_sg().  But there's no
pci_dma_sync_sg calls when the CPU tries to write directly to the MTT
table, and in fact not even that would work: since a
non-cache-coherent CPU can only work on cacheline-sized chunks there's
no safe way to touch the MTT table.

What all that means is that FMRs are currently broken for memfree on
non-coherent CPUs.  And this patch would break all memory
registration.  I think the fix has to be to use dma_alloc_coherent()
to allocate the pages for the MTT table (and any other table allocated
in lowmem -- but I don't think there are any others).

Unfortunately my PowerPC 440 system is being reworked right now so I
can't test this for a few days.

I think this still can go into 2.6.20 after -rc1 if we can get this
fixed up.

 - R.


From rdreier at cisco.com  Wed Dec 13 14:29:45 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:29:45 -0800
Subject: [openib-general] [PATCH] mthca: move code from post send to
	post receive
In-Reply-To: <20061213114916.GA23726@mellanox.co.il> (Michael S.
	Tsirkin's message of "Wed, 13 Dec 2006 13:49:16 +0200")
References: <20061213114916.GA23726@mellanox.co.il>
Message-ID: <adazm9rqvja.fsf@cisco.com>

 > While unlikely to give a large gain, this makes sense to me.

Out of curiousity -- can you measure any difference at all with this?
I would have guessed that the addition can be scheduled so that it
costs nothing at all on any common CPU.

I guess it doesn't hurt though.  Want to make a similar patch for libmthca?

 - R.


From rdreier at cisco.com  Wed Dec 13 14:30:54 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:30:54 -0800
Subject: [openib-general] [GIT PULL] please pull infiniband.git
In-Reply-To: <457FB82B.4090902@voltaire.com> (Or Gerlitz's message of
	"Wed, 13 Dec 2006 10:22:03 +0200")
References: <ada8xhctztu.fsf@cisco.com> <457FB82B.4090902@voltaire.com>
Message-ID: <adavekfqvhd.fsf@cisco.com>

 > you have CC-ed lkml at cisco.com on this email, is there a chance you
 > wanted to CC linux-kernel at vger.kernel.org instead ...

Yep, a typo caused by my auto-expand not triggering.  No big deal though...

 > May i ask what prevented the v3 of the mthca profile patch (see
 > http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in?

The patch as posted is both ugly and wrong.  I still plan to fix it up
and merge it for 2.6.20, but I didn't get a chance yet.

 - R.


From rdreier at cisco.com  Wed Dec 13 14:32:28 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:32:28 -0800
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com> (
	keshetti mahesh's message of "Wed, 13 Dec 2006 06:55:13 +0000 (GMT)")
References: <20061213065514.30377.qmail@web8322.mail.in.yahoo.com>
Message-ID: <adar6v3qver.fsf@cisco.com>

 > But isn't it handled by SMA in the host...... i am little bit confused now .
 > please just whether  it is required to implement process_mad (suppose) for new HCA driver....if it is required  why?

You can think of the process_mad() method as the interface from the
SMA to the hardware.  For example when a set of PortInfo occurs, then
the hardware has to know what the local LID is, etc.


From rdreier at cisco.com  Wed Dec 13 14:37:06 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 14:37:06 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <457F080C.2090202@ichips.intel.com> (Sean Hefty's message
	of "Tue, 12 Dec 2006 11:50:36 -0800")
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<adaejr4vrmp.fsf@cisco.com> <457F080C.2090202@ichips.intel.com>
Message-ID: <adairgfqv71.fsf@cisco.com>

 > I don't think so.  The code followed the ucm, which is likely whatever
 > Libor had done.  Did umad or uverbs follow this same format at some
 > point?  In any case, this and the ucm could probably both be cleaned
 > up.

I don't think umad/uverbs ever looked like that.  You picked the wrong
code to copy ;)

Anyway I'll cook up a patch to clean it up at some point...


From tom at opengridcomputing.com  Wed Dec 13 14:38:18 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Wed, 13 Dec 2006 16:38:18 -0600
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <Pine.LNX.4.64.0612131552340.20796@jlentini-linux.nane.netapp.com>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
	<Pine.LNX.4.64.0612131552340.20796@jlentini-linux.nane.netapp.com>
Message-ID: <1166049498.10873.7.camel@trinity.ogc.int>


22 is EINVAL. I believe the only way to get this on an RDMA connection
is when there is an error in the RPCRDMA header. The completing WR is
just a flush that resulted from shutting the connection down.


On Wed, 2006-12-13 at 17:09 -0500, James Lentini wrote:
> 
> On Tue, 12 Dec 2006, Vu Pham wrote:
> 
> > > > 2.  While some clients run I/Os, one idle client try to access the mount
> > > > point ie. *ls* and get I/O input error. I see these error messages on
> > > > server log
> 
> Was there anything in the log before this point? I'd expect to see a 
> message started with "svcrdma: failed to post SQ..."
> 
> > > > Dec 12 13:58:29 ibd202 kernel: nfsd: terminating on error 22
> > > > Dec 12 13:58:29 ibd202 kernel: svcrdma: bad WR completion
> > > > Dec 12 13:58:29 ibd202 kernel:  ctxt=ffff810242130800, count=1 on
> > > > xprt=ffff8102431c0400, rqstp=ffff8102414cdc00, status=5
> > > > ...
> > > > Dec 12 14:04:29 ibd202 kernel: ib_mthca 0000:08:00.0: CQ entry for unknown
> > > > QP 2e0408


From tom at opengridcomputing.com  Wed Dec 13 14:40:50 2006
From: tom at opengridcomputing.com (Tom Tucker)
Date: Wed, 13 Dec 2006 16:40:50 -0600
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <457F426B.7020104@mellanox.com>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
Message-ID: <1166049650.10873.9.camel@trinity.ogc.int>

Vu:

[...snip...] 
> > 
> > 2. Can you please send me the iozone test parameters your using?
> > 
> 
> server has 8GB of mem, client has 2GB of mem
> 
> iozone -r 64KB -s 5g -i 0 -i 1
> and
> iozone -r 64KB -s 2g -i 0 -i 1 -t 3
> 

Can you please send me the iozone output you get from these commands?
Thanks,

> thanks,
> -vu
> 
> > Thanks,
> > Tom
> >> thanks,
> >> -vu
> > 
> 
> plain text document attachment (.config)
> #
> # Automatically generated make config: don't edit
> # Linux kernel version: 2.6.18.5
> # Tue Dec 12 10:08:55 2006
> #
> CONFIG_X86_64=y
> CONFIG_64BIT=y
> CONFIG_X86=y
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_SEMAPHORE_SLEEPERS=y
> CONFIG_MMU=y
> CONFIG_RWSEM_GENERIC_SPINLOCK=y
> CONFIG_GENERIC_HWEIGHT=y
> CONFIG_GENERIC_CALIBRATE_DELAY=y
> CONFIG_X86_CMPXCHG=y
> CONFIG_EARLY_PRINTK=y
> CONFIG_GENERIC_ISA_DMA=y
> CONFIG_GENERIC_IOMAP=y
> CONFIG_ARCH_MAY_HAVE_PC_FDC=y
> CONFIG_DMI=y
> CONFIG_AUDIT_ARCH=y
> CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
> 
> #
> # Code maturity level options
> #
> CONFIG_EXPERIMENTAL=y
> CONFIG_LOCK_KERNEL=y
> CONFIG_INIT_ENV_ARG_LIMIT=32
> 
> #
> # General setup
> #
> CONFIG_LOCALVERSION=""
> CONFIG_LOCALVERSION_AUTO=y
> CONFIG_SWAP=y
> CONFIG_SYSVIPC=y
> CONFIG_POSIX_MQUEUE=y
> CONFIG_BSD_PROCESS_ACCT=y
> # CONFIG_BSD_PROCESS_ACCT_V3 is not set
> # CONFIG_TASKSTATS is not set
> CONFIG_AUDIT=y
> CONFIG_AUDITSYSCALL=y
> # CONFIG_IKCONFIG is not set
> # CONFIG_CPUSETS is not set
> # CONFIG_RELAY is not set
> CONFIG_INITRAMFS_SOURCE=""
> CONFIG_CC_OPTIMIZE_FOR_SIZE=y
> # CONFIG_EMBEDDED is not set
> CONFIG_UID16=y
> CONFIG_SYSCTL=y
> CONFIG_KALLSYMS=y
> # CONFIG_KALLSYMS_ALL is not set
> CONFIG_KALLSYMS_EXTRA_PASS=y
> CONFIG_HOTPLUG=y
> CONFIG_PRINTK=y
> CONFIG_BUG=y
> CONFIG_ELF_CORE=y
> CONFIG_BASE_FULL=y
> CONFIG_FUTEX=y
> CONFIG_EPOLL=y
> CONFIG_SHMEM=y
> CONFIG_SLAB=y
> CONFIG_VM_EVENT_COUNTERS=y
> CONFIG_RT_MUTEXES=y
> # CONFIG_TINY_SHMEM is not set
> CONFIG_BASE_SMALL=0
> # CONFIG_SLOB is not set
> 
> #
> # Loadable module support
> #
> CONFIG_MODULES=y
> CONFIG_MODULE_UNLOAD=y
> # CONFIG_MODULE_FORCE_UNLOAD is not set
> CONFIG_MODVERSIONS=y
> # CONFIG_MODULE_SRCVERSION_ALL is not set
> CONFIG_KMOD=y
> CONFIG_STOP_MACHINE=y
> 
> #
> # Block layer
> #
> CONFIG_LBD=y
> # CONFIG_BLK_DEV_IO_TRACE is not set
> CONFIG_LSF=y
> 
> #
> # IO Schedulers
> #
> CONFIG_IOSCHED_NOOP=y
> CONFIG_IOSCHED_AS=y
> CONFIG_IOSCHED_DEADLINE=y
> CONFIG_IOSCHED_CFQ=y
> # CONFIG_DEFAULT_AS is not set
> CONFIG_DEFAULT_DEADLINE=y
> # CONFIG_DEFAULT_CFQ is not set
> # CONFIG_DEFAULT_NOOP is not set
> CONFIG_DEFAULT_IOSCHED="deadline"
> 
> #
> # Processor type and features
> #
> CONFIG_X86_PC=y
> # CONFIG_X86_VSMP is not set
> # CONFIG_MK8 is not set
> # CONFIG_MPSC is not set
> CONFIG_GENERIC_CPU=y
> CONFIG_X86_L1_CACHE_BYTES=128
> CONFIG_X86_L1_CACHE_SHIFT=7
> CONFIG_X86_INTERNODE_CACHE_BYTES=128
> CONFIG_X86_TSC=y
> CONFIG_X86_GOOD_APIC=y
> CONFIG_MICROCODE=m
> CONFIG_X86_MSR=y
> CONFIG_X86_CPUID=y
> CONFIG_X86_HT=y
> CONFIG_X86_IO_APIC=y
> CONFIG_X86_LOCAL_APIC=y
> CONFIG_MTRR=y
> CONFIG_SMP=y
> CONFIG_SCHED_SMT=y
> CONFIG_SCHED_MC=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> CONFIG_PREEMPT_BKL=y
> CONFIG_NUMA=y
> CONFIG_K8_NUMA=y
> CONFIG_NODES_SHIFT=6
> CONFIG_X86_64_ACPI_NUMA=y
> # CONFIG_NUMA_EMU is not set
> CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
> CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
> CONFIG_ARCH_SPARSEMEM_ENABLE=y
> CONFIG_SELECT_MEMORY_MODEL=y
> # CONFIG_FLATMEM_MANUAL is not set
> CONFIG_DISCONTIGMEM_MANUAL=y
> # CONFIG_SPARSEMEM_MANUAL is not set
> CONFIG_DISCONTIGMEM=y
> CONFIG_FLAT_NODE_MEM_MAP=y
> CONFIG_NEED_MULTIPLE_NODES=y
> # CONFIG_SPARSEMEM_STATIC is not set
> CONFIG_SPLIT_PTLOCK_CPUS=4
> CONFIG_MIGRATION=y
> CONFIG_RESOURCES_64BIT=y
> CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
> CONFIG_OUT_OF_LINE_PFN_TO_PAGE=y
> CONFIG_NR_CPUS=8
> # CONFIG_HOTPLUG_CPU is not set
> CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
> CONFIG_HPET_TIMER=y
> CONFIG_HPET_EMULATE_RTC=y
> CONFIG_IOMMU=y
> CONFIG_CALGARY_IOMMU=y
> CONFIG_SWIOTLB=y
> CONFIG_X86_MCE=y
> CONFIG_X86_MCE_INTEL=y
> CONFIG_X86_MCE_AMD=y
> # CONFIG_KEXEC is not set
> # CONFIG_CRASH_DUMP is not set
> CONFIG_PHYSICAL_START=0x200000
> CONFIG_SECCOMP=y
> # CONFIG_HZ_100 is not set
> CONFIG_HZ_250=y
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=250
> # CONFIG_REORDER is not set
> CONFIG_K8_NB=y
> CONFIG_GENERIC_HARDIRQS=y
> CONFIG_GENERIC_IRQ_PROBE=y
> CONFIG_ISA_DMA_API=y
> CONFIG_GENERIC_PENDING_IRQ=y
> 
> #
> # Power management options
> #
> CONFIG_PM=y
> CONFIG_PM_LEGACY=y
> # CONFIG_PM_DEBUG is not set
> 
> #
> # ACPI (Advanced Configuration and Power Interface) Support
> #
> CONFIG_ACPI=y
> CONFIG_ACPI_AC=m
> CONFIG_ACPI_BATTERY=m
> CONFIG_ACPI_BUTTON=m
> CONFIG_ACPI_VIDEO=y
> # CONFIG_ACPI_HOTKEY is not set
> CONFIG_ACPI_FAN=y
> # CONFIG_ACPI_DOCK is not set
> CONFIG_ACPI_PROCESSOR=y
> CONFIG_ACPI_THERMAL=y
> CONFIG_ACPI_NUMA=y
> CONFIG_ACPI_ASUS=m
> # CONFIG_ACPI_IBM is not set
> CONFIG_ACPI_TOSHIBA=m
> CONFIG_ACPI_BLACKLIST_YEAR=0
> # CONFIG_ACPI_DEBUG is not set
> CONFIG_ACPI_EC=y
> CONFIG_ACPI_POWER=y
> CONFIG_ACPI_SYSTEM=y
> CONFIG_X86_PM_TIMER=y
> # CONFIG_ACPI_CONTAINER is not set
> # CONFIG_ACPI_SBS is not set
> 
> #
> # CPU Frequency scaling
> #
> CONFIG_CPU_FREQ=y
> CONFIG_CPU_FREQ_TABLE=y
> # CONFIG_CPU_FREQ_DEBUG is not set
> CONFIG_CPU_FREQ_STAT=y
> # CONFIG_CPU_FREQ_STAT_DETAILS is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
> CONFIG_CPU_FREQ_GOV_USERSPACE=y
> CONFIG_CPU_FREQ_GOV_ONDEMAND=m
> # CONFIG_CPU_FREQ_GOV_CONSERVATIVE is not set
> 
> #
> # CPUFreq processor drivers
> #
> CONFIG_X86_POWERNOW_K8=y
> CONFIG_X86_POWERNOW_K8_ACPI=y
> CONFIG_X86_SPEEDSTEP_CENTRINO=y
> CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
> CONFIG_X86_ACPI_CPUFREQ=y
> 
> #
> # shared options
> #
> # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
> # CONFIG_X86_SPEEDSTEP_LIB is not set
> 
> #
> # Bus options (PCI etc.)
> #
> CONFIG_PCI=y
> CONFIG_PCI_DIRECT=y
> CONFIG_PCI_MMCONFIG=y
> # CONFIG_PCIEPORTBUS is not set
> CONFIG_PCI_MSI=y
> # CONFIG_PCI_DEBUG is not set
> 
> #
> # PCCARD (PCMCIA/CardBus) support
> #
> # CONFIG_PCCARD is not set
> 
> #
> # PCI Hotplug Support
> #
> CONFIG_HOTPLUG_PCI=y
> # CONFIG_HOTPLUG_PCI_FAKE is not set
> CONFIG_HOTPLUG_PCI_ACPI=m
> CONFIG_HOTPLUG_PCI_ACPI_IBM=m
> # CONFIG_HOTPLUG_PCI_CPCI is not set
> CONFIG_HOTPLUG_PCI_SHPC=m
> # CONFIG_HOTPLUG_PCI_SHPC_POLL_EVENT_MODE is not set
> 
> #
> # Executable file formats / Emulations
> #
> CONFIG_BINFMT_ELF=y
> CONFIG_BINFMT_MISC=y
> CONFIG_IA32_EMULATION=y
> # CONFIG_IA32_AOUT is not set
> CONFIG_COMPAT=y
> CONFIG_SYSVIPC_COMPAT=y
> 
> #
> # Networking
> #
> CONFIG_NET=y
> 
> #
> # Networking options
> #
> # CONFIG_NETDEBUG is not set
> CONFIG_PACKET=y
> CONFIG_PACKET_MMAP=y
> CONFIG_UNIX=y
> CONFIG_XFRM=y
> CONFIG_XFRM_USER=y
> CONFIG_NET_KEY=m
> CONFIG_INET=y
> CONFIG_IP_MULTICAST=y
> CONFIG_IP_ADVANCED_ROUTER=y
> CONFIG_ASK_IP_FIB_HASH=y
> # CONFIG_IP_FIB_TRIE is not set
> CONFIG_IP_FIB_HASH=y
> CONFIG_IP_MULTIPLE_TABLES=y
> CONFIG_IP_ROUTE_FWMARK=y
> CONFIG_IP_ROUTE_MULTIPATH=y
> # CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set
> CONFIG_IP_ROUTE_VERBOSE=y
> # CONFIG_IP_PNP is not set
> CONFIG_NET_IPIP=m
> CONFIG_NET_IPGRE=m
> CONFIG_NET_IPGRE_BROADCAST=y
> CONFIG_IP_MROUTE=y
> CONFIG_IP_PIMSM_V1=y
> CONFIG_IP_PIMSM_V2=y
> # CONFIG_ARPD is not set
> CONFIG_SYN_COOKIES=y
> CONFIG_INET_AH=m
> CONFIG_INET_ESP=m
> CONFIG_INET_IPCOMP=m
> CONFIG_INET_XFRM_TUNNEL=m
> CONFIG_INET_TUNNEL=m
> CONFIG_INET_XFRM_MODE_TRANSPORT=y
> CONFIG_INET_XFRM_MODE_TUNNEL=y
> CONFIG_INET_DIAG=y
> CONFIG_INET_TCP_DIAG=y
> # CONFIG_TCP_CONG_ADVANCED is not set
> CONFIG_TCP_CONG_BIC=y
> 
> #
> # IP: Virtual Server Configuration
> #
> CONFIG_IP_VS=m
> # CONFIG_IP_VS_DEBUG is not set
> CONFIG_IP_VS_TAB_BITS=12
> 
> #
> # IPVS transport protocol load balancing support
> #
> CONFIG_IP_VS_PROTO_TCP=y
> CONFIG_IP_VS_PROTO_UDP=y
> CONFIG_IP_VS_PROTO_ESP=y
> CONFIG_IP_VS_PROTO_AH=y
> 
> #
> # IPVS scheduler
> #
> CONFIG_IP_VS_RR=m
> CONFIG_IP_VS_WRR=m
> CONFIG_IP_VS_LC=m
> CONFIG_IP_VS_WLC=m
> CONFIG_IP_VS_LBLC=m
> CONFIG_IP_VS_LBLCR=m
> CONFIG_IP_VS_DH=m
> CONFIG_IP_VS_SH=m
> CONFIG_IP_VS_SED=m
> CONFIG_IP_VS_NQ=m
> 
> #
> # IPVS application helper
> #
> CONFIG_IP_VS_FTP=m
> CONFIG_IPV6=m
> CONFIG_IPV6_PRIVACY=y
> # CONFIG_IPV6_ROUTER_PREF is not set
> CONFIG_INET6_AH=m
> CONFIG_INET6_ESP=m
> CONFIG_INET6_IPCOMP=m
> CONFIG_INET6_XFRM_TUNNEL=m
> CONFIG_INET6_TUNNEL=m
> CONFIG_INET6_XFRM_MODE_TRANSPORT=m
> CONFIG_INET6_XFRM_MODE_TUNNEL=m
> CONFIG_IPV6_TUNNEL=m
> CONFIG_NETWORK_SECMARK=y
> CONFIG_NETFILTER=y
> # CONFIG_NETFILTER_DEBUG is not set
> CONFIG_BRIDGE_NETFILTER=y
> 
> #
> # Core Netfilter Configuration
> #
> # CONFIG_NETFILTER_NETLINK is not set
> # CONFIG_NETFILTER_XTABLES is not set
> 
> #
> # IP: Netfilter Configuration
> #
> CONFIG_IP_NF_CONNTRACK=m
> CONFIG_IP_NF_CT_ACCT=y
> # CONFIG_IP_NF_CONNTRACK_MARK is not set
> # CONFIG_IP_NF_CONNTRACK_SECMARK is not set
> # CONFIG_IP_NF_CONNTRACK_EVENTS is not set
> CONFIG_IP_NF_CT_PROTO_SCTP=m
> CONFIG_IP_NF_FTP=m
> CONFIG_IP_NF_IRC=m
> # CONFIG_IP_NF_NETBIOS_NS is not set
> CONFIG_IP_NF_TFTP=m
> CONFIG_IP_NF_AMANDA=m
> # CONFIG_IP_NF_PPTP is not set
> # CONFIG_IP_NF_H323 is not set
> # CONFIG_IP_NF_SIP is not set
> CONFIG_IP_NF_QUEUE=m
> 
> #
> # IPv6: Netfilter Configuration (EXPERIMENTAL)
> #
> # CONFIG_IP6_NF_QUEUE is not set
> 
> #
> # Bridge: Netfilter Configuration
> #
> CONFIG_BRIDGE_NF_EBTABLES=m
> CONFIG_BRIDGE_EBT_BROUTE=m
> CONFIG_BRIDGE_EBT_T_FILTER=m
> CONFIG_BRIDGE_EBT_T_NAT=m
> CONFIG_BRIDGE_EBT_802_3=m
> CONFIG_BRIDGE_EBT_AMONG=m
> CONFIG_BRIDGE_EBT_ARP=m
> CONFIG_BRIDGE_EBT_IP=m
> CONFIG_BRIDGE_EBT_LIMIT=m
> CONFIG_BRIDGE_EBT_MARK=m
> CONFIG_BRIDGE_EBT_PKTTYPE=m
> CONFIG_BRIDGE_EBT_STP=m
> CONFIG_BRIDGE_EBT_VLAN=m
> CONFIG_BRIDGE_EBT_ARPREPLY=m
> CONFIG_BRIDGE_EBT_DNAT=m
> CONFIG_BRIDGE_EBT_MARK_T=m
> CONFIG_BRIDGE_EBT_REDIRECT=m
> CONFIG_BRIDGE_EBT_SNAT=m
> CONFIG_BRIDGE_EBT_LOG=m
> # CONFIG_BRIDGE_EBT_ULOG is not set
> 
> #
> # DCCP Configuration (EXPERIMENTAL)
> #
> # CONFIG_IP_DCCP is not set
> 
> #
> # SCTP Configuration (EXPERIMENTAL)
> #
> CONFIG_IP_SCTP=m
> # CONFIG_SCTP_DBG_MSG is not set
> # CONFIG_SCTP_DBG_OBJCNT is not set
> # CONFIG_SCTP_HMAC_NONE is not set
> # CONFIG_SCTP_HMAC_SHA1 is not set
> CONFIG_SCTP_HMAC_MD5=y
> 
> #
> # TIPC Configuration (EXPERIMENTAL)
> #
> # CONFIG_TIPC is not set
> CONFIG_ATM=m
> CONFIG_ATM_CLIP=m
> # CONFIG_ATM_CLIP_NO_ICMP is not set
> CONFIG_ATM_LANE=m
> # CONFIG_ATM_MPOA is not set
> CONFIG_ATM_BR2684=m
> # CONFIG_ATM_BR2684_IPFILTER is not set
> CONFIG_BRIDGE=m
> CONFIG_VLAN_8021Q=m
> # CONFIG_DECNET is not set
> CONFIG_LLC=y
> # CONFIG_LLC2 is not set
> # CONFIG_IPX is not set
> # CONFIG_ATALK is not set
> # CONFIG_X25 is not set
> # CONFIG_LAPB is not set
> # CONFIG_ECONET is not set
> # CONFIG_WAN_ROUTER is not set
> 
> #
> # QoS and/or fair queueing
> #
> CONFIG_NET_SCHED=y
> CONFIG_NET_SCH_CLK_JIFFIES=y
> # CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set
> # CONFIG_NET_SCH_CLK_CPU is not set
> 
> #
> # Queueing/Scheduling
> #
> CONFIG_NET_SCH_CBQ=m
> CONFIG_NET_SCH_HTB=m
> CONFIG_NET_SCH_HFSC=m
> CONFIG_NET_SCH_ATM=m
> CONFIG_NET_SCH_PRIO=m
> CONFIG_NET_SCH_RED=m
> CONFIG_NET_SCH_SFQ=m
> CONFIG_NET_SCH_TEQL=m
> CONFIG_NET_SCH_TBF=m
> CONFIG_NET_SCH_GRED=m
> CONFIG_NET_SCH_DSMARK=m
> CONFIG_NET_SCH_NETEM=m
> CONFIG_NET_SCH_INGRESS=m
> 
> #
> # Classification
> #
> CONFIG_NET_CLS=y
> # CONFIG_NET_CLS_BASIC is not set
> CONFIG_NET_CLS_TCINDEX=m
> CONFIG_NET_CLS_ROUTE4=m
> CONFIG_NET_CLS_ROUTE=y
> CONFIG_NET_CLS_FW=m
> CONFIG_NET_CLS_U32=m
> CONFIG_CLS_U32_PERF=y
> # CONFIG_CLS_U32_MARK is not set
> CONFIG_NET_CLS_RSVP=m
> CONFIG_NET_CLS_RSVP6=m
> # CONFIG_NET_EMATCH is not set
> # CONFIG_NET_CLS_ACT is not set
> CONFIG_NET_CLS_POLICE=y
> CONFIG_NET_CLS_IND=y
> CONFIG_NET_ESTIMATOR=y
> 
> #
> # Network testing
> #
> # CONFIG_NET_PKTGEN is not set
> # CONFIG_NET_TCPPROBE is not set
> # CONFIG_HAMRADIO is not set
> # CONFIG_IRDA is not set
> CONFIG_BT=m
> CONFIG_BT_L2CAP=m
> CONFIG_BT_SCO=m
> CONFIG_BT_RFCOMM=m
> CONFIG_BT_RFCOMM_TTY=y
> CONFIG_BT_BNEP=m
> CONFIG_BT_BNEP_MC_FILTER=y
> CONFIG_BT_BNEP_PROTO_FILTER=y
> CONFIG_BT_CMTP=m
> CONFIG_BT_HIDP=m
> 
> #
> # Bluetooth device drivers
> #
> CONFIG_BT_HCIUSB=m
> CONFIG_BT_HCIUSB_SCO=y
> CONFIG_BT_HCIUART=m
> CONFIG_BT_HCIUART_H4=y
> CONFIG_BT_HCIUART_BCSP=y
> CONFIG_BT_HCIBCM203X=m
> # CONFIG_BT_HCIBPA10X is not set
> CONFIG_BT_HCIBFUSB=m
> CONFIG_BT_HCIVHCI=m
> CONFIG_IEEE80211=m
> # CONFIG_IEEE80211_DEBUG is not set
> # CONFIG_IEEE80211_CRYPT_WEP is not set
> # CONFIG_IEEE80211_CRYPT_CCMP is not set
> CONFIG_IEEE80211_CRYPT_TKIP=m
> # CONFIG_IEEE80211_SOFTMAC is not set
> CONFIG_WIRELESS_EXT=y
> 
> #
> # Device Drivers
> #
> 
> #
> # Generic Driver Options
> #
> CONFIG_STANDALONE=y
> CONFIG_PREVENT_FIRMWARE_BUILD=y
> CONFIG_FW_LOADER=y
> # CONFIG_DEBUG_DRIVER is not set
> # CONFIG_SYS_HYPERVISOR is not set
> 
> #
> # Connector - unified userspace <-> kernelspace linker
> #
> # CONFIG_CONNECTOR is not set
> 
> #
> # Memory Technology Devices (MTD)
> #
> CONFIG_MTD=m
> # CONFIG_MTD_DEBUG is not set
> CONFIG_MTD_CONCAT=m
> CONFIG_MTD_PARTITIONS=y
> CONFIG_MTD_REDBOOT_PARTS=m
> CONFIG_MTD_REDBOOT_DIRECTORY_BLOCK=-1
> # CONFIG_MTD_REDBOOT_PARTS_UNALLOCATED is not set
> # CONFIG_MTD_REDBOOT_PARTS_READONLY is not set
> CONFIG_MTD_CMDLINE_PARTS=y
> 
> #
> # User Modules And Translation Layers
> #
> CONFIG_MTD_CHAR=m
> CONFIG_MTD_BLOCK=m
> CONFIG_MTD_BLOCK_RO=m
> CONFIG_FTL=m
> CONFIG_NFTL=m
> CONFIG_NFTL_RW=y
> # CONFIG_INFTL is not set
> # CONFIG_RFD_FTL is not set
> 
> #
> # RAM/ROM/Flash chip drivers
> #
> CONFIG_MTD_CFI=m
> CONFIG_MTD_JEDECPROBE=m
> CONFIG_MTD_GEN_PROBE=m
> # CONFIG_MTD_CFI_ADV_OPTIONS is not set
> CONFIG_MTD_MAP_BANK_WIDTH_1=y
> CONFIG_MTD_MAP_BANK_WIDTH_2=y
> CONFIG_MTD_MAP_BANK_WIDTH_4=y
> # CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
> # CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
> # CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
> CONFIG_MTD_CFI_I1=y
> CONFIG_MTD_CFI_I2=y
> # CONFIG_MTD_CFI_I4 is not set
> # CONFIG_MTD_CFI_I8 is not set
> CONFIG_MTD_CFI_INTELEXT=m
> CONFIG_MTD_CFI_AMDSTD=m
> CONFIG_MTD_CFI_STAA=m
> CONFIG_MTD_CFI_UTIL=m
> CONFIG_MTD_RAM=m
> CONFIG_MTD_ROM=m
> CONFIG_MTD_ABSENT=m
> # CONFIG_MTD_OBSOLETE_CHIPS is not set
> 
> #
> # Mapping drivers for chip access
> #
> CONFIG_MTD_COMPLEX_MAPPINGS=y
> # CONFIG_MTD_PHYSMAP is not set
> # CONFIG_MTD_PNC2000 is not set
> CONFIG_MTD_SC520CDP=m
> CONFIG_MTD_NETSC520=m
> # CONFIG_MTD_TS5500 is not set
> # CONFIG_MTD_SBC_GXX is not set
> # CONFIG_MTD_AMD76XROM is not set
> CONFIG_MTD_ICHXROM=m
> # CONFIG_MTD_SCB2_FLASH is not set
> # CONFIG_MTD_NETtel is not set
> # CONFIG_MTD_DILNETPC is not set
> # CONFIG_MTD_L440GX is not set
> # CONFIG_MTD_PCI is not set
> # CONFIG_MTD_PLATRAM is not set
> 
> #
> # Self-contained MTD device drivers
> #
> # CONFIG_MTD_PMC551 is not set
> # CONFIG_MTD_SLRAM is not set
> # CONFIG_MTD_PHRAM is not set
> CONFIG_MTD_MTDRAM=m
> CONFIG_MTDRAM_TOTAL_SIZE=4096
> CONFIG_MTDRAM_ERASE_SIZE=128
> # CONFIG_MTD_BLOCK2MTD is not set
> 
> #
> # Disk-On-Chip Device Drivers
> #
> # CONFIG_MTD_DOC2000 is not set
> # CONFIG_MTD_DOC2001 is not set
> # CONFIG_MTD_DOC2001PLUS is not set
> 
> #
> # NAND Flash Device Drivers
> #
> CONFIG_MTD_NAND=m
> # CONFIG_MTD_NAND_VERIFY_WRITE is not set
> # CONFIG_MTD_NAND_ECC_SMC is not set
> CONFIG_MTD_NAND_IDS=m
> # CONFIG_MTD_NAND_DISKONCHIP is not set
> # CONFIG_MTD_NAND_NANDSIM is not set
> 
> #
> # OneNAND Flash Device Drivers
> #
> # CONFIG_MTD_ONENAND is not set
> 
> #
> # Parallel port support
> #
> CONFIG_PARPORT=m
> CONFIG_PARPORT_PC=m
> CONFIG_PARPORT_SERIAL=m
> # CONFIG_PARPORT_PC_FIFO is not set
> # CONFIG_PARPORT_PC_SUPERIO is not set
> CONFIG_PARPORT_NOT_PC=y
> # CONFIG_PARPORT_GSC is not set
> # CONFIG_PARPORT_AX88796 is not set
> CONFIG_PARPORT_1284=y
> 
> #
> # Plug and Play support
> #
> # CONFIG_PNP is not set
> 
> #
> # Block devices
> #
> CONFIG_BLK_DEV_FD=m
> # CONFIG_PARIDE is not set
> CONFIG_BLK_CPQ_DA=m
> CONFIG_BLK_CPQ_CISS_DA=m
> CONFIG_CISS_SCSI_TAPE=y
> CONFIG_BLK_DEV_DAC960=m
> # CONFIG_BLK_DEV_UMEM is not set
> # CONFIG_BLK_DEV_COW_COMMON is not set
> CONFIG_BLK_DEV_LOOP=m
> CONFIG_BLK_DEV_CRYPTOLOOP=m
> CONFIG_BLK_DEV_NBD=m
> CONFIG_BLK_DEV_SX8=m
> # CONFIG_BLK_DEV_UB is not set
> CONFIG_BLK_DEV_RAM=y
> CONFIG_BLK_DEV_RAM_COUNT=16
> CONFIG_BLK_DEV_RAM_SIZE=16384
> CONFIG_BLK_DEV_RAM_BLOCKSIZE=1024
> CONFIG_BLK_DEV_INITRD=y
> # CONFIG_CDROM_PKTCDVD is not set
> # CONFIG_ATA_OVER_ETH is not set
> 
> #
> # ATA/ATAPI/MFM/RLL support
> #
> CONFIG_IDE=y
> CONFIG_BLK_DEV_IDE=y
> 
> #
> # Please see Documentation/ide.txt for help/info on IDE drives
> #
> # CONFIG_BLK_DEV_IDE_SATA is not set
> # CONFIG_BLK_DEV_HD_IDE is not set
> CONFIG_BLK_DEV_IDEDISK=y
> CONFIG_IDEDISK_MULTI_MODE=y
> CONFIG_BLK_DEV_IDECD=y
> # CONFIG_BLK_DEV_IDETAPE is not set
> CONFIG_BLK_DEV_IDEFLOPPY=y
> CONFIG_BLK_DEV_IDESCSI=m
> # CONFIG_IDE_TASK_IOCTL is not set
> 
> #
> # IDE chipset support/bugfixes
> #
> CONFIG_IDE_GENERIC=y
> # CONFIG_BLK_DEV_CMD640 is not set
> CONFIG_BLK_DEV_IDEPCI=y
> CONFIG_IDEPCI_SHARE_IRQ=y
> # CONFIG_BLK_DEV_OFFBOARD is not set
> CONFIG_BLK_DEV_GENERIC=y
> # CONFIG_BLK_DEV_OPTI621 is not set
> CONFIG_BLK_DEV_RZ1000=y
> CONFIG_BLK_DEV_IDEDMA_PCI=y
> # CONFIG_BLK_DEV_IDEDMA_FORCED is not set
> CONFIG_IDEDMA_PCI_AUTO=y
> # CONFIG_IDEDMA_ONLYDISK is not set
> CONFIG_BLK_DEV_AEC62XX=y
> CONFIG_BLK_DEV_ALI15X3=y
> # CONFIG_WDC_ALI15X3 is not set
> CONFIG_BLK_DEV_AMD74XX=y
> CONFIG_BLK_DEV_ATIIXP=y
> CONFIG_BLK_DEV_CMD64X=y
> CONFIG_BLK_DEV_TRIFLEX=y
> CONFIG_BLK_DEV_CY82C693=y
> CONFIG_BLK_DEV_CS5520=y
> CONFIG_BLK_DEV_CS5530=y
> CONFIG_BLK_DEV_HPT34X=y
> # CONFIG_HPT34X_AUTODMA is not set
> CONFIG_BLK_DEV_HPT366=y
> # CONFIG_BLK_DEV_SC1200 is not set
> CONFIG_BLK_DEV_PIIX=y
> # CONFIG_BLK_DEV_IT821X is not set
> # CONFIG_BLK_DEV_NS87415 is not set
> CONFIG_BLK_DEV_PDC202XX_OLD=y
> # CONFIG_PDC202XX_BURST is not set
> CONFIG_BLK_DEV_PDC202XX_NEW=y
> CONFIG_BLK_DEV_SVWKS=y
> CONFIG_BLK_DEV_SIIMAGE=y
> CONFIG_BLK_DEV_SIS5513=y
> CONFIG_BLK_DEV_SLC90E66=y
> # CONFIG_BLK_DEV_TRM290 is not set
> CONFIG_BLK_DEV_VIA82CXXX=y
> # CONFIG_IDE_ARM is not set
> CONFIG_BLK_DEV_IDEDMA=y
> # CONFIG_IDEDMA_IVB is not set
> CONFIG_IDEDMA_AUTO=y
> # CONFIG_BLK_DEV_HD is not set
> 
> #
> # SCSI device support
> #
> # CONFIG_RAID_ATTRS is not set
> CONFIG_SCSI=m
> CONFIG_SCSI_PROC_FS=y
> 
> #
> # SCSI support type (disk, tape, CD-ROM)
> #
> CONFIG_BLK_DEV_SD=m
> CONFIG_CHR_DEV_ST=m
> CONFIG_CHR_DEV_OSST=m
> CONFIG_BLK_DEV_SR=m
> CONFIG_BLK_DEV_SR_VENDOR=y
> CONFIG_CHR_DEV_SG=m
> # CONFIG_CHR_DEV_SCH is not set
> 
> #
> # Some SCSI devices (e.g. CD jukebox) support multiple LUNs
> #
> # CONFIG_SCSI_MULTI_LUN is not set
> CONFIG_SCSI_CONSTANTS=y
> CONFIG_SCSI_LOGGING=y
> 
> #
> # SCSI Transport Attributes
> #
> CONFIG_SCSI_SPI_ATTRS=m
> CONFIG_SCSI_FC_ATTRS=m
> CONFIG_SCSI_ISCSI_ATTRS=m
> CONFIG_SCSI_SAS_ATTRS=m
> 
> #
> # SCSI low-level drivers
> #
> # CONFIG_ISCSI_TCP is not set
> CONFIG_BLK_DEV_3W_XXXX_RAID=m
> CONFIG_SCSI_3W_9XXX=m
> CONFIG_SCSI_ACARD=m
> CONFIG_SCSI_AACRAID=m
> CONFIG_SCSI_AIC7XXX=m
> CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
> CONFIG_AIC7XXX_RESET_DELAY_MS=15000
> # CONFIG_AIC7XXX_DEBUG_ENABLE is not set
> CONFIG_AIC7XXX_DEBUG_MASK=0
> # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
> CONFIG_SCSI_AIC7XXX_OLD=m
> CONFIG_SCSI_AIC79XX=m
> CONFIG_AIC79XX_CMDS_PER_DEVICE=4
> CONFIG_AIC79XX_RESET_DELAY_MS=15000
> # CONFIG_AIC79XX_ENABLE_RD_STRM is not set
> # CONFIG_AIC79XX_DEBUG_ENABLE is not set
> CONFIG_AIC79XX_DEBUG_MASK=0
> # CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
> CONFIG_MEGARAID_NEWGEN=y
> CONFIG_MEGARAID_MM=m
> CONFIG_MEGARAID_MAILBOX=m
> # CONFIG_MEGARAID_LEGACY is not set
> CONFIG_MEGARAID_SAS=m
> CONFIG_SCSI_SATA=m
> CONFIG_SCSI_SATA_AHCI=m
> CONFIG_SCSI_SATA_SVW=m
> CONFIG_SCSI_ATA_PIIX=m
> # CONFIG_SCSI_SATA_MV is not set
> CONFIG_SCSI_SATA_NV=m
> # CONFIG_SCSI_PDC_ADMA is not set
> # CONFIG_SCSI_HPTIOP is not set
> # CONFIG_SCSI_SATA_QSTOR is not set
> CONFIG_SCSI_SATA_PROMISE=m
> CONFIG_SCSI_SATA_SX4=m
> CONFIG_SCSI_SATA_SIL=m
> # CONFIG_SCSI_SATA_SIL24 is not set
> CONFIG_SCSI_SATA_SIS=m
> # CONFIG_SCSI_SATA_ULI is not set
> CONFIG_SCSI_SATA_VIA=m
> CONFIG_SCSI_SATA_VITESSE=m
> CONFIG_SCSI_SATA_INTEL_COMBINED=y
> # CONFIG_SCSI_BUSLOGIC is not set
> # CONFIG_SCSI_DMX3191D is not set
> # CONFIG_SCSI_EATA is not set
> # CONFIG_SCSI_FUTURE_DOMAIN is not set
> CONFIG_SCSI_GDTH=m
> CONFIG_SCSI_IPS=m
> CONFIG_SCSI_INITIO=m
> # CONFIG_SCSI_INIA100 is not set
> CONFIG_SCSI_PPA=m
> CONFIG_SCSI_IMM=m
> # CONFIG_SCSI_IZIP_EPP16 is not set
> # CONFIG_SCSI_IZIP_SLOW_CTR is not set
> CONFIG_SCSI_SYM53C8XX_2=m
> CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
> CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
> CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
> CONFIG_SCSI_SYM53C8XX_MMIO=y
> # CONFIG_SCSI_IPR is not set
> CONFIG_SCSI_QLOGIC_1280=m
> # CONFIG_SCSI_QLA_FC is not set
> CONFIG_SCSI_LPFC=m
> # CONFIG_SCSI_DC395x is not set
> # CONFIG_SCSI_DC390T is not set
> # CONFIG_SCSI_DEBUG is not set
> 
> #
> # Multi-device support (RAID and LVM)
> #
> CONFIG_MD=y
> CONFIG_BLK_DEV_MD=y
> CONFIG_MD_LINEAR=m
> CONFIG_MD_RAID0=m
> CONFIG_MD_RAID1=m
> CONFIG_MD_RAID10=m
> # CONFIG_MD_RAID456 is not set
> CONFIG_MD_MULTIPATH=m
> # CONFIG_MD_FAULTY is not set
> CONFIG_BLK_DEV_DM=m
> CONFIG_DM_CRYPT=m
> CONFIG_DM_SNAPSHOT=m
> CONFIG_DM_MIRROR=m
> CONFIG_DM_ZERO=m
> CONFIG_DM_MULTIPATH=m
> CONFIG_DM_MULTIPATH_EMC=m
> 
> #
> # Fusion MPT device support
> #
> CONFIG_FUSION=y
> CONFIG_FUSION_SPI=m
> CONFIG_FUSION_FC=m
> CONFIG_FUSION_SAS=m
> CONFIG_FUSION_MAX_SGE=40
> CONFIG_FUSION_CTL=m
> CONFIG_FUSION_LAN=m
> 
> #
> # IEEE 1394 (FireWire) support
> #
> # CONFIG_IEEE1394 is not set
> 
> #
> # I2O device support
> #
> CONFIG_I2O=m
> CONFIG_I2O_LCT_NOTIFY_ON_CHANGES=y
> CONFIG_I2O_EXT_ADAPTEC=y
> CONFIG_I2O_EXT_ADAPTEC_DMA64=y
> CONFIG_I2O_CONFIG=m
> CONFIG_I2O_CONFIG_OLD_IOCTL=y
> # CONFIG_I2O_BUS is not set
> CONFIG_I2O_BLOCK=m
> CONFIG_I2O_SCSI=m
> CONFIG_I2O_PROC=m
> 
> #
> # Network device support
> #
> CONFIG_NETDEVICES=y
> CONFIG_DUMMY=m
> CONFIG_BONDING=m
> # CONFIG_EQUALIZER is not set
> CONFIG_TUN=m
> 
> #
> # ARCnet devices
> #
> # CONFIG_ARCNET is not set
> 
> #
> # PHY device support
> #
> # CONFIG_PHYLIB is not set
> 
> #
> # Ethernet (10 or 100Mbit)
> #
> CONFIG_NET_ETHERNET=y
> CONFIG_MII=m
> CONFIG_HAPPYMEAL=m
> CONFIG_SUNGEM=m
> # CONFIG_CASSINI is not set
> CONFIG_NET_VENDOR_3COM=y
> CONFIG_VORTEX=m
> CONFIG_TYPHOON=m
> 
> #
> # Tulip family network device support
> #
> CONFIG_NET_TULIP=y
> CONFIG_DE2104X=m
> CONFIG_TULIP=m
> # CONFIG_TULIP_MWI is not set
> CONFIG_TULIP_MMIO=y
> # CONFIG_TULIP_NAPI is not set
> CONFIG_DE4X5=m
> CONFIG_WINBOND_840=m
> CONFIG_DM9102=m
> # CONFIG_ULI526X is not set
> # CONFIG_HP100 is not set
> CONFIG_NET_PCI=y
> CONFIG_PCNET32=m
> CONFIG_AMD8111_ETH=m
> CONFIG_AMD8111E_NAPI=y
> CONFIG_ADAPTEC_STARFIRE=m
> CONFIG_ADAPTEC_STARFIRE_NAPI=y
> CONFIG_B44=m
> CONFIG_FORCEDETH=m
> # CONFIG_DGRS is not set
> CONFIG_EEPRO100=m
> CONFIG_E100=m
> CONFIG_FEALNX=m
> CONFIG_NATSEMI=m
> CONFIG_NE2K_PCI=m
> CONFIG_8139CP=m
> CONFIG_8139TOO=m
> CONFIG_8139TOO_PIO=y
> # CONFIG_8139TOO_TUNE_TWISTER is not set
> CONFIG_8139TOO_8129=y
> # CONFIG_8139_OLD_RX_RESET is not set
> CONFIG_SIS900=m
> CONFIG_EPIC100=m
> # CONFIG_SUNDANCE is not set
> CONFIG_VIA_RHINE=m
> CONFIG_VIA_RHINE_MMIO=y
> # CONFIG_VIA_RHINE_NAPI is not set
> # CONFIG_NET_POCKET is not set
> 
> #
> # Ethernet (1000 Mbit)
> #
> CONFIG_ACENIC=m
> # CONFIG_ACENIC_OMIT_TIGON_I is not set
> CONFIG_DL2K=m
> CONFIG_E1000=m
> CONFIG_E1000_NAPI=y
> # CONFIG_E1000_DISABLE_PACKET_SPLIT is not set
> CONFIG_NS83820=m
> # CONFIG_HAMACHI is not set
> # CONFIG_YELLOWFIN is not set
> CONFIG_R8169=m
> CONFIG_R8169_NAPI=y
> # CONFIG_R8169_VLAN is not set
> # CONFIG_SIS190 is not set
> # CONFIG_SKGE is not set
> CONFIG_SKY2=m
> CONFIG_SK98LIN=m
> CONFIG_VIA_VELOCITY=m
> CONFIG_TIGON3=m
> CONFIG_BNX2=m
> 
> #
> # Ethernet (10000 Mbit)
> #
> # CONFIG_CHELSIO_T1 is not set
> CONFIG_IXGB=m
> CONFIG_IXGB_NAPI=y
> CONFIG_S2IO=m
> CONFIG_S2IO_NAPI=y
> # CONFIG_MYRI10GE is not set
> 
> #
> # Token Ring devices
> #
> CONFIG_TR=y
> CONFIG_IBMOL=m
> CONFIG_3C359=m
> CONFIG_TMS380TR=m
> CONFIG_TMSPCI=m
> CONFIG_ABYSS=m
> 
> #
> # Wireless LAN (non-hamradio)
> #
> CONFIG_NET_RADIO=y
> # CONFIG_NET_WIRELESS_RTNETLINK is not set
> 
> #
> # Obsolete Wireless cards support (pre-802.11)
> #
> # CONFIG_STRIP is not set
> 
> #
> # Wireless 802.11b ISA/PCI cards support
> #
> CONFIG_IPW2100=m
> # CONFIG_IPW2100_MONITOR is not set
> # CONFIG_IPW2100_DEBUG is not set
> CONFIG_IPW2200=m
> # CONFIG_IPW2200_MONITOR is not set
> # CONFIG_IPW2200_QOS is not set
> # CONFIG_IPW2200_DEBUG is not set
> # CONFIG_AIRO is not set
> CONFIG_HERMES=m
> CONFIG_PLX_HERMES=m
> CONFIG_TMD_HERMES=m
> # CONFIG_NORTEL_HERMES is not set
> CONFIG_PCI_HERMES=m
> CONFIG_ATMEL=m
> CONFIG_PCI_ATMEL=m
> 
> #
> # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
> #
> CONFIG_PRISM54=m
> # CONFIG_USB_ZD1201 is not set
> # CONFIG_HOSTAP is not set
> CONFIG_NET_WIRELESS=y
> 
> #
> # Wan interfaces
> #
> # CONFIG_WAN is not set
> 
> #
> # ATM drivers
> #
> # CONFIG_ATM_DUMMY is not set
> CONFIG_ATM_TCP=m
> CONFIG_ATM_LANAI=m
> CONFIG_ATM_ENI=m
> # CONFIG_ATM_ENI_DEBUG is not set
> # CONFIG_ATM_ENI_TUNE_BURST is not set
> CONFIG_ATM_FIRESTREAM=m
> # CONFIG_ATM_ZATM is not set
> CONFIG_ATM_IDT77252=m
> # CONFIG_ATM_IDT77252_DEBUG is not set
> # CONFIG_ATM_IDT77252_RCV_ALL is not set
> CONFIG_ATM_IDT77252_USE_SUNI=y
> CONFIG_ATM_AMBASSADOR=m
> # CONFIG_ATM_AMBASSADOR_DEBUG is not set
> CONFIG_ATM_HORIZON=m
> # CONFIG_ATM_HORIZON_DEBUG is not set
> CONFIG_ATM_FORE200E_MAYBE=m
> # CONFIG_ATM_FORE200E_PCA is not set
> CONFIG_ATM_HE=m
> # CONFIG_ATM_HE_USE_SUNI is not set
> CONFIG_FDDI=y
> # CONFIG_DEFXX is not set
> # CONFIG_SKFP is not set
> # CONFIG_HIPPI is not set
> # CONFIG_PLIP is not set
> CONFIG_PPP=m
> CONFIG_PPP_MULTILINK=y
> CONFIG_PPP_FILTER=y
> CONFIG_PPP_ASYNC=m
> CONFIG_PPP_SYNC_TTY=m
> CONFIG_PPP_DEFLATE=m
> # CONFIG_PPP_BSDCOMP is not set
> # CONFIG_PPP_MPPE is not set
> CONFIG_PPPOE=m
> CONFIG_PPPOATM=m
> # CONFIG_SLIP is not set
> CONFIG_NET_FC=y
> # CONFIG_SHAPER is not set
> CONFIG_NETCONSOLE=m
> CONFIG_NETPOLL=y
> # CONFIG_NETPOLL_RX is not set
> CONFIG_NETPOLL_TRAP=y
> CONFIG_NET_POLL_CONTROLLER=y
> 
> #
> # ISDN subsystem
> #
> CONFIG_ISDN=m
> 
> #
> # Old ISDN4Linux
> #
> CONFIG_ISDN_I4L=m
> CONFIG_ISDN_PPP=y
> CONFIG_ISDN_PPP_VJ=y
> CONFIG_ISDN_MPP=y
> CONFIG_IPPP_FILTER=y
> # CONFIG_ISDN_PPP_BSDCOMP is not set
> CONFIG_ISDN_AUDIO=y
> CONFIG_ISDN_TTY_FAX=y
> 
> #
> # ISDN feature submodules
> #
> # CONFIG_ISDN_DIVERSION is not set
> 
> #
> # ISDN4Linux hardware drivers
> #
> 
> #
> # Passive cards
> #
> CONFIG_ISDN_DRV_HISAX=m
> 
> #
> # D-channel protocol features
> #
> CONFIG_HISAX_EURO=y
> CONFIG_DE_AOC=y
> CONFIG_HISAX_NO_SENDCOMPLETE=y
> CONFIG_HISAX_NO_LLC=y
> CONFIG_HISAX_NO_KEYPAD=y
> CONFIG_HISAX_1TR6=y
> CONFIG_HISAX_NI1=y
> CONFIG_HISAX_MAX_CARDS=8
> 
> #
> # HiSax supported cards
> #
> CONFIG_HISAX_16_3=y
> CONFIG_HISAX_TELESPCI=y
> CONFIG_HISAX_S0BOX=y
> CONFIG_HISAX_FRITZPCI=y
> CONFIG_HISAX_AVM_A1_PCMCIA=y
> CONFIG_HISAX_ELSA=y
> CONFIG_HISAX_DIEHLDIVA=y
> CONFIG_HISAX_SEDLBAUER=y
> CONFIG_HISAX_NETJET=y
> CONFIG_HISAX_NETJET_U=y
> CONFIG_HISAX_NICCY=y
> CONFIG_HISAX_BKM_A4T=y
> CONFIG_HISAX_SCT_QUADRO=y
> CONFIG_HISAX_GAZEL=y
> CONFIG_HISAX_HFC_PCI=y
> CONFIG_HISAX_W6692=y
> CONFIG_HISAX_HFC_SX=y
> CONFIG_HISAX_ENTERNOW_PCI=y
> # CONFIG_HISAX_DEBUG is not set
> 
> #
> # HiSax PCMCIA card service modules
> #
> 
> #
> # HiSax sub driver modules
> #
> CONFIG_HISAX_ST5481=m
> CONFIG_HISAX_HFCUSB=m
> # CONFIG_HISAX_HFC4S8S is not set
> CONFIG_HISAX_FRITZ_PCIPNP=m
> CONFIG_HISAX_HDLC=y
> 
> #
> # Active cards
> #
> 
> #
> # Siemens Gigaset
> #
> # CONFIG_ISDN_DRV_GIGASET is not set
> 
> #
> # CAPI subsystem
> #
> CONFIG_ISDN_CAPI=m
> CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON=y
> CONFIG_ISDN_CAPI_MIDDLEWARE=y
> CONFIG_ISDN_CAPI_CAPI20=m
> CONFIG_ISDN_CAPI_CAPIFS_BOOL=y
> CONFIG_ISDN_CAPI_CAPIFS=m
> CONFIG_ISDN_CAPI_CAPIDRV=m
> 
> #
> # CAPI hardware drivers
> #
> 
> #
> # Active AVM cards
> #
> CONFIG_CAPI_AVM=y
> CONFIG_ISDN_DRV_AVMB1_B1PCI=m
> CONFIG_ISDN_DRV_AVMB1_B1PCIV4=y
> CONFIG_ISDN_DRV_AVMB1_B1PCMCIA=m
> CONFIG_ISDN_DRV_AVMB1_T1PCI=m
> CONFIG_ISDN_DRV_AVMB1_C4=m
> 
> #
> # Active Eicon DIVA Server cards
> #
> # CONFIG_CAPI_EICON is not set
> 
> #
> # Telephony Support
> #
> # CONFIG_PHONE is not set
> 
> #
> # Input device support
> #
> CONFIG_INPUT=y
> 
> #
> # Userland interfaces
> #
> CONFIG_INPUT_MOUSEDEV=y
> # CONFIG_INPUT_MOUSEDEV_PSAUX is not set
> CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
> CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
> CONFIG_INPUT_JOYDEV=m
> # CONFIG_INPUT_TSDEV is not set
> CONFIG_INPUT_EVDEV=y
> # CONFIG_INPUT_EVBUG is not set
> 
> #
> # Input Device Drivers
> #
> CONFIG_INPUT_KEYBOARD=y
> CONFIG_KEYBOARD_ATKBD=y
> # CONFIG_KEYBOARD_SUNKBD is not set
> # CONFIG_KEYBOARD_LKKBD is not set
> # CONFIG_KEYBOARD_XTKBD is not set
> # CONFIG_KEYBOARD_NEWTON is not set
> CONFIG_INPUT_MOUSE=y
> CONFIG_MOUSE_PS2=y
> CONFIG_MOUSE_SERIAL=m
> CONFIG_MOUSE_VSXXXAA=m
> CONFIG_INPUT_JOYSTICK=y
> # CONFIG_JOYSTICK_ANALOG is not set
> # CONFIG_JOYSTICK_A3D is not set
> # CONFIG_JOYSTICK_ADI is not set
> # CONFIG_JOYSTICK_COBRA is not set
> # CONFIG_JOYSTICK_GF2K is not set
> # CONFIG_JOYSTICK_GRIP is not set
> # CONFIG_JOYSTICK_GRIP_MP is not set
> # CONFIG_JOYSTICK_GUILLEMOT is not set
> # CONFIG_JOYSTICK_INTERACT is not set
> # CONFIG_JOYSTICK_SIDEWINDER is not set
> # CONFIG_JOYSTICK_TMDC is not set
> # CONFIG_JOYSTICK_IFORCE is not set
> # CONFIG_JOYSTICK_WARRIOR is not set
> # CONFIG_JOYSTICK_MAGELLAN is not set
> # CONFIG_JOYSTICK_SPACEORB is not set
> # CONFIG_JOYSTICK_SPACEBALL is not set
> # CONFIG_JOYSTICK_STINGER is not set
> # CONFIG_JOYSTICK_TWIDJOY is not set
> # CONFIG_JOYSTICK_DB9 is not set
> # CONFIG_JOYSTICK_GAMECON is not set
> # CONFIG_JOYSTICK_TURBOGRAFX is not set
> # CONFIG_JOYSTICK_JOYDUMP is not set
> CONFIG_INPUT_TOUCHSCREEN=y
> CONFIG_TOUCHSCREEN_GUNZE=m
> # CONFIG_TOUCHSCREEN_ELO is not set
> # CONFIG_TOUCHSCREEN_MTOUCH is not set
> # CONFIG_TOUCHSCREEN_MK712 is not set
> CONFIG_INPUT_MISC=y
> CONFIG_INPUT_PCSPKR=m
> CONFIG_INPUT_UINPUT=m
> 
> #
> # Hardware I/O ports
> #
> CONFIG_SERIO=y
> CONFIG_SERIO_I8042=y
> CONFIG_SERIO_SERPORT=y
> # CONFIG_SERIO_CT82C710 is not set
> # CONFIG_SERIO_PARKBD is not set
> # CONFIG_SERIO_PCIPS2 is not set
> CONFIG_SERIO_LIBPS2=y
> # CONFIG_SERIO_RAW is not set
> # CONFIG_GAMEPORT is not set
> 
> #
> # Character devices
> #
> CONFIG_VT=y
> CONFIG_VT_CONSOLE=y
> CONFIG_HW_CONSOLE=y
> # CONFIG_VT_HW_CONSOLE_BINDING is not set
> CONFIG_SERIAL_NONSTANDARD=y
> # CONFIG_COMPUTONE is not set
> # CONFIG_ROCKETPORT is not set
> # CONFIG_CYCLADES is not set
> # CONFIG_DIGIEPCA is not set
> # CONFIG_MOXA_INTELLIO is not set
> # CONFIG_MOXA_SMARTIO is not set
> # CONFIG_ISI is not set
> # CONFIG_SYNCLINK is not set
> # CONFIG_SYNCLINKMP is not set
> # CONFIG_SYNCLINK_GT is not set
> CONFIG_N_HDLC=m
> # CONFIG_SPECIALIX is not set
> # CONFIG_SX is not set
> # CONFIG_RIO is not set
> CONFIG_STALDRV=y
> 
> #
> # Serial drivers
> #
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_8250_CONSOLE=y
> CONFIG_SERIAL_8250_PCI=y
> CONFIG_SERIAL_8250_NR_UARTS=4
> CONFIG_SERIAL_8250_RUNTIME_UARTS=4
> CONFIG_SERIAL_8250_EXTENDED=y
> # CONFIG_SERIAL_8250_MANY_PORTS is not set
> CONFIG_SERIAL_8250_SHARE_IRQ=y
> CONFIG_SERIAL_8250_DETECT_IRQ=y
> CONFIG_SERIAL_8250_RSA=y
> 
> #
> # Non-8250 serial port support
> #
> CONFIG_SERIAL_CORE=y
> CONFIG_SERIAL_CORE_CONSOLE=y
> # CONFIG_SERIAL_JSM is not set
> CONFIG_UNIX98_PTYS=y
> # CONFIG_LEGACY_PTYS is not set
> CONFIG_PRINTER=m
> CONFIG_LP_CONSOLE=y
> CONFIG_PPDEV=m
> # CONFIG_TIPAR is not set
> 
> #
> # IPMI
> #
> CONFIG_IPMI_HANDLER=m
> # CONFIG_IPMI_PANIC_EVENT is not set
> CONFIG_IPMI_DEVICE_INTERFACE=m
> CONFIG_IPMI_SI=m
> CONFIG_IPMI_WATCHDOG=m
> CONFIG_IPMI_POWEROFF=m
> 
> #
> # Watchdog Cards
> #
> CONFIG_WATCHDOG=y
> # CONFIG_WATCHDOG_NOWAYOUT is not set
> 
> #
> # Watchdog Device Drivers
> #
> CONFIG_SOFT_WATCHDOG=m
> CONFIG_ACQUIRE_WDT=m
> CONFIG_ADVANTECH_WDT=m
> CONFIG_ALIM1535_WDT=m
> CONFIG_ALIM7101_WDT=m
> CONFIG_SC520_WDT=m
> CONFIG_EUROTECH_WDT=m
> CONFIG_IB700_WDT=m
> # CONFIG_IBMASR is not set
> CONFIG_WAFER_WDT=m
> # CONFIG_I6300ESB_WDT is not set
> CONFIG_I8XX_TCO=m
> CONFIG_SC1200_WDT=m
> # CONFIG_60XX_WDT is not set
> # CONFIG_SBC8360_WDT is not set
> CONFIG_CPU5_WDT=m
> CONFIG_W83627HF_WDT=m
> CONFIG_W83877F_WDT=m
> # CONFIG_W83977F_WDT is not set
> CONFIG_MACHZ_WDT=m
> # CONFIG_SBC_EPX_C3_WATCHDOG is not set
> 
> #
> # PCI-based Watchdog Cards
> #
> CONFIG_PCIPCWATCHDOG=m
> CONFIG_WDTPCI=m
> CONFIG_WDT_501_PCI=y
> 
> #
> # USB-based Watchdog Cards
> #
> CONFIG_USBPCWATCHDOG=m
> CONFIG_HW_RANDOM=y
> CONFIG_HW_RANDOM_INTEL=y
> CONFIG_HW_RANDOM_AMD=y
> CONFIG_HW_RANDOM_GEODE=y
> # CONFIG_NVRAM is not set
> CONFIG_RTC=y
> CONFIG_DTLK=m
> # CONFIG_R3964 is not set
> # CONFIG_APPLICOM is not set
> 
> #
> # Ftape, the floppy tape device driver
> #
> CONFIG_AGP=y
> CONFIG_AGP_AMD64=y
> # CONFIG_AGP_INTEL is not set
> # CONFIG_AGP_SIS is not set
> # CONFIG_AGP_VIA is not set
> CONFIG_DRM=y
> # CONFIG_DRM_TDFX is not set
> CONFIG_DRM_R128=m
> CONFIG_DRM_RADEON=m
> CONFIG_DRM_MGA=m
> # CONFIG_DRM_SIS is not set
> # CONFIG_DRM_VIA is not set
> # CONFIG_DRM_SAVAGE is not set
> # CONFIG_MWAVE is not set
> # CONFIG_PC8736x_GPIO is not set
> CONFIG_RAW_DRIVER=y
> CONFIG_MAX_RAW_DEVS=8192
> # CONFIG_HPET is not set
> CONFIG_HANGCHECK_TIMER=m
> 
> #
> # TPM devices
> #
> # CONFIG_TCG_TPM is not set
> # CONFIG_TELCLOCK is not set
> 
> #
> # I2C support
> #
> CONFIG_I2C=m
> CONFIG_I2C_CHARDEV=m
> 
> #
> # I2C Algorithms
> #
> CONFIG_I2C_ALGOBIT=m
> CONFIG_I2C_ALGOPCF=m
> CONFIG_I2C_ALGOPCA=m
> 
> #
> # I2C Hardware Bus support
> #
> CONFIG_I2C_ALI1535=m
> CONFIG_I2C_ALI1563=m
> CONFIG_I2C_ALI15X3=m
> CONFIG_I2C_AMD756=m
> # CONFIG_I2C_AMD756_S4882 is not set
> CONFIG_I2C_AMD8111=m
> CONFIG_I2C_I801=m
> CONFIG_I2C_I810=m
> # CONFIG_I2C_PIIX4 is not set
> CONFIG_I2C_ISA=m
> CONFIG_I2C_NFORCE2=m
> # CONFIG_I2C_OCORES is not set
> # CONFIG_I2C_PARPORT is not set
> # CONFIG_I2C_PARPORT_LIGHT is not set
> CONFIG_I2C_PROSAVAGE=m
> CONFIG_I2C_SAVAGE4=m
> CONFIG_I2C_SIS5595=m
> CONFIG_I2C_SIS630=m
> CONFIG_I2C_SIS96X=m
> # CONFIG_I2C_STUB is not set
> CONFIG_I2C_VIA=m
> CONFIG_I2C_VIAPRO=m
> CONFIG_I2C_VOODOO3=m
> # CONFIG_I2C_PCA_ISA is not set
> 
> #
> # Miscellaneous I2C Chip support
> #
> # CONFIG_SENSORS_DS1337 is not set
> # CONFIG_SENSORS_DS1374 is not set
> CONFIG_SENSORS_EEPROM=m
> CONFIG_SENSORS_PCF8574=m
> # CONFIG_SENSORS_PCA9539 is not set
> CONFIG_SENSORS_PCF8591=m
> # CONFIG_SENSORS_MAX6875 is not set
> # CONFIG_I2C_DEBUG_CORE is not set
> # CONFIG_I2C_DEBUG_ALGO is not set
> # CONFIG_I2C_DEBUG_BUS is not set
> # CONFIG_I2C_DEBUG_CHIP is not set
> 
> #
> # SPI support
> #
> # CONFIG_SPI is not set
> # CONFIG_SPI_MASTER is not set
> 
> #
> # Dallas's 1-wire bus
> #
> 
> #
> # Hardware Monitoring support
> #
> CONFIG_HWMON=y
> CONFIG_HWMON_VID=m
> # CONFIG_SENSORS_ABITUGURU is not set
> CONFIG_SENSORS_ADM1021=m
> CONFIG_SENSORS_ADM1025=m
> # CONFIG_SENSORS_ADM1026 is not set
> CONFIG_SENSORS_ADM1031=m
> # CONFIG_SENSORS_ADM9240 is not set
> CONFIG_SENSORS_ASB100=m
> # CONFIG_SENSORS_ATXP1 is not set
> CONFIG_SENSORS_DS1621=m
> # CONFIG_SENSORS_F71805F is not set
> CONFIG_SENSORS_FSCHER=m
> # CONFIG_SENSORS_FSCPOS is not set
> CONFIG_SENSORS_GL518SM=m
> # CONFIG_SENSORS_GL520SM is not set
> CONFIG_SENSORS_IT87=m
> # CONFIG_SENSORS_LM63 is not set
> CONFIG_SENSORS_LM75=m
> CONFIG_SENSORS_LM77=m
> CONFIG_SENSORS_LM78=m
> CONFIG_SENSORS_LM80=m
> CONFIG_SENSORS_LM83=m
> CONFIG_SENSORS_LM85=m
> # CONFIG_SENSORS_LM87 is not set
> CONFIG_SENSORS_LM90=m
> # CONFIG_SENSORS_LM92 is not set
> CONFIG_SENSORS_MAX1619=m
> # CONFIG_SENSORS_PC87360 is not set
> # CONFIG_SENSORS_SIS5595 is not set
> CONFIG_SENSORS_SMSC47M1=m
> # CONFIG_SENSORS_SMSC47M192 is not set
> # CONFIG_SENSORS_SMSC47B397 is not set
> CONFIG_SENSORS_VIA686A=m
> # CONFIG_SENSORS_VT8231 is not set
> CONFIG_SENSORS_W83781D=m
> # CONFIG_SENSORS_W83791D is not set
> # CONFIG_SENSORS_W83792D is not set
> CONFIG_SENSORS_W83L785TS=m
> CONFIG_SENSORS_W83627HF=m
> # CONFIG_SENSORS_W83627EHF is not set
> # CONFIG_SENSORS_HDAPS is not set
> # CONFIG_HWMON_DEBUG_CHIP is not set
> 
> #
> # Misc devices
> #
> # CONFIG_IBM_ASM is not set
> 
> #
> # Multimedia devices
> #
> CONFIG_VIDEO_DEV=m
> CONFIG_VIDEO_V4L1=y
> CONFIG_VIDEO_V4L1_COMPAT=y
> CONFIG_VIDEO_V4L2=y
> 
> #
> # Video Capture Adapters
> #
> 
> #
> # Video Capture Adapters
> #
> # CONFIG_VIDEO_ADV_DEBUG is not set
> # CONFIG_VIDEO_VIVI is not set
> # CONFIG_VIDEO_BT848 is not set
> # CONFIG_VIDEO_BWQCAM is not set
> # CONFIG_VIDEO_CQCAM is not set
> # CONFIG_VIDEO_W9966 is not set
> # CONFIG_VIDEO_CPIA is not set
> # CONFIG_VIDEO_CPIA2 is not set
> # CONFIG_VIDEO_SAA5246A is not set
> # CONFIG_VIDEO_SAA5249 is not set
> # CONFIG_TUNER_3036 is not set
> # CONFIG_VIDEO_STRADIS is not set
> # CONFIG_VIDEO_ZORAN is not set
> # CONFIG_VIDEO_SAA7134 is not set
> # CONFIG_VIDEO_MXB is not set
> # CONFIG_VIDEO_DPC is not set
> # CONFIG_VIDEO_HEXIUM_ORION is not set
> # CONFIG_VIDEO_HEXIUM_GEMINI is not set
> # CONFIG_VIDEO_CX88 is not set
> 
> #
> # Encoders and Decoders
> #
> # CONFIG_VIDEO_MSP3400 is not set
> # CONFIG_VIDEO_CS53L32A is not set
> # CONFIG_VIDEO_TLV320AIC23B is not set
> # CONFIG_VIDEO_WM8775 is not set
> # CONFIG_VIDEO_WM8739 is not set
> # CONFIG_VIDEO_CX2341X is not set
> # CONFIG_VIDEO_CX25840 is not set
> # CONFIG_VIDEO_SAA711X is not set
> # CONFIG_VIDEO_SAA7127 is not set
> # CONFIG_VIDEO_UPD64031A is not set
> # CONFIG_VIDEO_UPD64083 is not set
> 
> #
> # V4L USB devices
> #
> # CONFIG_VIDEO_PVRUSB2 is not set
> # CONFIG_VIDEO_EM28XX is not set
> CONFIG_VIDEO_USBVIDEO=m
> CONFIG_USB_VICAM=m
> CONFIG_USB_IBMCAM=m
> CONFIG_USB_KONICAWC=m
> # CONFIG_USB_QUICKCAM_MESSENGER is not set
> # CONFIG_USB_ET61X251 is not set
> CONFIG_VIDEO_OVCAMCHIP=m
> CONFIG_USB_W9968CF=m
> CONFIG_USB_OV511=m
> CONFIG_USB_SE401=m
> CONFIG_USB_SN9C102=m
> CONFIG_USB_STV680=m
> # CONFIG_USB_ZC0301 is not set
> CONFIG_USB_PWC=m
> # CONFIG_USB_PWC_DEBUG is not set
> 
> #
> # Radio Adapters
> #
> # CONFIG_RADIO_GEMTEK_PCI is not set
> # CONFIG_RADIO_MAXIRADIO is not set
> # CONFIG_RADIO_MAESTRO is not set
> CONFIG_USB_DSBR=m
> 
> #
> # Digital Video Broadcasting Devices
> #
> # CONFIG_DVB is not set
> CONFIG_USB_DABUSB=m
> 
> #
> # Graphics support
> #
> CONFIG_FIRMWARE_EDID=y
> CONFIG_FB=y
> CONFIG_FB_CFB_FILLRECT=y
> CONFIG_FB_CFB_COPYAREA=y
> CONFIG_FB_CFB_IMAGEBLIT=y
> # CONFIG_FB_MACMODES is not set
> # CONFIG_FB_BACKLIGHT is not set
> CONFIG_FB_MODE_HELPERS=y
> # CONFIG_FB_TILEBLITTING is not set
> CONFIG_FB_CIRRUS=m
> # CONFIG_FB_PM2 is not set
> # CONFIG_FB_CYBER2000 is not set
> # CONFIG_FB_ARC is not set
> # CONFIG_FB_ASILIANT is not set
> # CONFIG_FB_IMSTT is not set
> CONFIG_FB_VGA16=m
> CONFIG_FB_VESA=y
> # CONFIG_FB_HGA is not set
> # CONFIG_FB_S1D13XXX is not set
> # CONFIG_FB_NVIDIA is not set
> CONFIG_FB_RIVA=m
> # CONFIG_FB_RIVA_I2C is not set
> # CONFIG_FB_RIVA_DEBUG is not set
> # CONFIG_FB_INTEL is not set
> # CONFIG_FB_MATROX is not set
> # CONFIG_FB_RADEON is not set
> # CONFIG_FB_ATY128 is not set
> # CONFIG_FB_ATY is not set
> # CONFIG_FB_SAVAGE is not set
> # CONFIG_FB_SIS is not set
> # CONFIG_FB_NEOMAGIC is not set
> CONFIG_FB_KYRO=m
> # CONFIG_FB_3DFX is not set
> # CONFIG_FB_VOODOO1 is not set
> # CONFIG_FB_TRIDENT is not set
> # CONFIG_FB_GEODE is not set
> # CONFIG_FB_VIRTUAL is not set
> 
> #
> # Console display driver support
> #
> CONFIG_VGA_CONSOLE=y
> # CONFIG_VGACON_SOFT_SCROLLBACK is not set
> CONFIG_VIDEO_SELECT=y
> CONFIG_DUMMY_CONSOLE=y
> CONFIG_FRAMEBUFFER_CONSOLE=y
> # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
> # CONFIG_FONTS is not set
> CONFIG_FONT_8x8=y
> CONFIG_FONT_8x16=y
> 
> #
> # Logo configuration
> #
> CONFIG_LOGO=y
> # CONFIG_LOGO_LINUX_MONO is not set
> # CONFIG_LOGO_LINUX_VGA16 is not set
> CONFIG_LOGO_LINUX_CLUT224=y
> # CONFIG_BACKLIGHT_LCD_SUPPORT is not set
> 
> #
> # Sound
> #
> CONFIG_SOUND=m
> 
> #
> # Advanced Linux Sound Architecture
> #
> CONFIG_SND=m
> CONFIG_SND_TIMER=m
> CONFIG_SND_PCM=m
> CONFIG_SND_HWDEP=m
> CONFIG_SND_RAWMIDI=m
> CONFIG_SND_SEQUENCER=m
> CONFIG_SND_SEQ_DUMMY=m
> CONFIG_SND_OSSEMUL=y
> CONFIG_SND_MIXER_OSS=m
> CONFIG_SND_PCM_OSS=m
> CONFIG_SND_PCM_OSS_PLUGINS=y
> CONFIG_SND_SEQUENCER_OSS=y
> CONFIG_SND_RTCTIMER=m
> CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
> # CONFIG_SND_DYNAMIC_MINORS is not set
> CONFIG_SND_SUPPORT_OLD_API=y
> CONFIG_SND_VERBOSE_PROCFS=y
> # CONFIG_SND_VERBOSE_PRINTK is not set
> # CONFIG_SND_DEBUG is not set
> 
> #
> # Generic devices
> #
> CONFIG_SND_MPU401_UART=m
> CONFIG_SND_OPL3_LIB=m
> CONFIG_SND_VX_LIB=m
> CONFIG_SND_AC97_CODEC=m
> CONFIG_SND_AC97_BUS=m
> CONFIG_SND_DUMMY=m
> CONFIG_SND_VIRMIDI=m
> CONFIG_SND_MTPAV=m
> # CONFIG_SND_SERIAL_U16550 is not set
> CONFIG_SND_MPU401=m
> 
> #
> # PCI devices
> #
> # CONFIG_SND_AD1889 is not set
> # CONFIG_SND_ALS300 is not set
> CONFIG_SND_ALS4000=m
> CONFIG_SND_ALI5451=m
> CONFIG_SND_ATIIXP=m
> CONFIG_SND_ATIIXP_MODEM=m
> CONFIG_SND_AU8810=m
> CONFIG_SND_AU8820=m
> CONFIG_SND_AU8830=m
> CONFIG_SND_AZT3328=m
> CONFIG_SND_BT87X=m
> # CONFIG_SND_BT87X_OVERCLOCK is not set
> # CONFIG_SND_CA0106 is not set
> CONFIG_SND_CMIPCI=m
> CONFIG_SND_CS4281=m
> CONFIG_SND_CS46XX=m
> CONFIG_SND_CS46XX_NEW_DSP=y
> # CONFIG_SND_DARLA20 is not set
> # CONFIG_SND_GINA20 is not set
> # CONFIG_SND_LAYLA20 is not set
> # CONFIG_SND_DARLA24 is not set
> # CONFIG_SND_GINA24 is not set
> # CONFIG_SND_LAYLA24 is not set
> # CONFIG_SND_MONA is not set
> # CONFIG_SND_MIA is not set
> # CONFIG_SND_ECHO3G is not set
> # CONFIG_SND_INDIGO is not set
> # CONFIG_SND_INDIGOIO is not set
> # CONFIG_SND_INDIGODJ is not set
> CONFIG_SND_EMU10K1=m
> # CONFIG_SND_EMU10K1X is not set
> CONFIG_SND_ENS1370=m
> CONFIG_SND_ENS1371=m
> CONFIG_SND_ES1938=m
> CONFIG_SND_ES1968=m
> CONFIG_SND_FM801=m
> # CONFIG_SND_FM801_TEA575X_BOOL is not set
> # CONFIG_SND_HDA_INTEL is not set
> CONFIG_SND_HDSP=m
> # CONFIG_SND_HDSPM is not set
> CONFIG_SND_ICE1712=m
> CONFIG_SND_ICE1724=m
> CONFIG_SND_INTEL8X0=m
> CONFIG_SND_INTEL8X0M=m
> CONFIG_SND_KORG1212=m
> CONFIG_SND_MAESTRO3=m
> CONFIG_SND_MIXART=m
> CONFIG_SND_NM256=m
> # CONFIG_SND_PCXHR is not set
> # CONFIG_SND_RIPTIDE is not set
> CONFIG_SND_RME32=m
> CONFIG_SND_RME96=m
> CONFIG_SND_RME9652=m
> CONFIG_SND_SONICVIBES=m
> CONFIG_SND_TRIDENT=m
> CONFIG_SND_VIA82XX=m
> # CONFIG_SND_VIA82XX_MODEM is not set
> CONFIG_SND_VX222=m
> CONFIG_SND_YMFPCI=m
> 
> #
> # USB devices
> #
> CONFIG_SND_USB_AUDIO=m
> CONFIG_SND_USB_USX2Y=m
> 
> #
> # Open Sound System
> #
> # CONFIG_SOUND_PRIME is not set
> 
> #
> # USB support
> #
> CONFIG_USB_ARCH_HAS_HCD=y
> CONFIG_USB_ARCH_HAS_OHCI=y
> CONFIG_USB_ARCH_HAS_EHCI=y
> CONFIG_USB=y
> # CONFIG_USB_DEBUG is not set
> 
> #
> # Miscellaneous USB options
> #
> CONFIG_USB_DEVICEFS=y
> # CONFIG_USB_BANDWIDTH is not set
> # CONFIG_USB_DYNAMIC_MINORS is not set
> CONFIG_USB_SUSPEND=y
> # CONFIG_USB_OTG is not set
> 
> #
> # USB Host Controller Drivers
> #
> CONFIG_USB_EHCI_HCD=m
> CONFIG_USB_EHCI_SPLIT_ISO=y
> CONFIG_USB_EHCI_ROOT_HUB_TT=y
> # CONFIG_USB_EHCI_TT_NEWSCHED is not set
> # CONFIG_USB_ISP116X_HCD is not set
> CONFIG_USB_OHCI_HCD=m
> # CONFIG_USB_OHCI_BIG_ENDIAN is not set
> CONFIG_USB_OHCI_LITTLE_ENDIAN=y
> CONFIG_USB_UHCI_HCD=m
> # CONFIG_USB_SL811_HCD is not set
> 
> #
> # USB Device Class drivers
> #
> CONFIG_USB_ACM=m
> CONFIG_USB_PRINTER=m
> 
> #
> # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
> #
> 
> #
> # may also be needed; see USB_STORAGE Help for more information
> #
> CONFIG_USB_STORAGE=m
> # CONFIG_USB_STORAGE_DEBUG is not set
> CONFIG_USB_STORAGE_DATAFAB=y
> CONFIG_USB_STORAGE_FREECOM=y
> CONFIG_USB_STORAGE_ISD200=y
> CONFIG_USB_STORAGE_DPCM=y
> # CONFIG_USB_STORAGE_USBAT is not set
> CONFIG_USB_STORAGE_SDDR09=y
> CONFIG_USB_STORAGE_SDDR55=y
> CONFIG_USB_STORAGE_JUMPSHOT=y
> # CONFIG_USB_STORAGE_ALAUDA is not set
> # CONFIG_USB_LIBUSUAL is not set
> 
> #
> # USB Input Devices
> #
> CONFIG_USB_HID=y
> CONFIG_USB_HIDINPUT=y
> # CONFIG_USB_HIDINPUT_POWERBOOK is not set
> CONFIG_HID_FF=y
> CONFIG_HID_PID=y
> CONFIG_LOGITECH_FF=y
> CONFIG_THRUSTMASTER_FF=y
> CONFIG_USB_HIDDEV=y
> CONFIG_USB_AIPTEK=m
> CONFIG_USB_WACOM=m
> # CONFIG_USB_ACECAD is not set
> CONFIG_USB_KBTAB=m
> CONFIG_USB_POWERMATE=m
> # CONFIG_USB_TOUCHSCREEN is not set
> # CONFIG_USB_YEALINK is not set
> CONFIG_USB_XPAD=m
> CONFIG_USB_ATI_REMOTE=m
> # CONFIG_USB_ATI_REMOTE2 is not set
> # CONFIG_USB_KEYSPAN_REMOTE is not set
> # CONFIG_USB_APPLETOUCH is not set
> 
> #
> # USB Imaging devices
> #
> CONFIG_USB_MDC800=m
> CONFIG_USB_MICROTEK=m
> 
> #
> # USB Network Adapters
> #
> CONFIG_USB_CATC=m
> CONFIG_USB_KAWETH=m
> CONFIG_USB_PEGASUS=m
> CONFIG_USB_RTL8150=m
> CONFIG_USB_USBNET=m
> CONFIG_USB_NET_AX8817X=m
> CONFIG_USB_NET_CDCETHER=m
> # CONFIG_USB_NET_GL620A is not set
> CONFIG_USB_NET_NET1080=m
> # CONFIG_USB_NET_PLUSB is not set
> # CONFIG_USB_NET_RNDIS_HOST is not set
> # CONFIG_USB_NET_CDC_SUBSET is not set
> CONFIG_USB_NET_ZAURUS=m
> CONFIG_USB_MON=y
> 
> #
> # USB port drivers
> #
> CONFIG_USB_USS720=m
> 
> #
> # USB Serial Converter support
> #
> CONFIG_USB_SERIAL=m
> CONFIG_USB_SERIAL_GENERIC=y
> # CONFIG_USB_SERIAL_AIRPRIME is not set
> # CONFIG_USB_SERIAL_ARK3116 is not set
> CONFIG_USB_SERIAL_BELKIN=m
> # CONFIG_USB_SERIAL_WHITEHEAT is not set
> CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
> # CONFIG_USB_SERIAL_CP2101 is not set
> # CONFIG_USB_SERIAL_CYPRESS_M8 is not set
> CONFIG_USB_SERIAL_EMPEG=m
> CONFIG_USB_SERIAL_FTDI_SIO=m
> # CONFIG_USB_SERIAL_FUNSOFT is not set
> CONFIG_USB_SERIAL_VISOR=m
> CONFIG_USB_SERIAL_IPAQ=m
> CONFIG_USB_SERIAL_IR=m
> CONFIG_USB_SERIAL_EDGEPORT=m
> CONFIG_USB_SERIAL_EDGEPORT_TI=m
> # CONFIG_USB_SERIAL_GARMIN is not set
> # CONFIG_USB_SERIAL_IPW is not set
> CONFIG_USB_SERIAL_KEYSPAN_PDA=m
> CONFIG_USB_SERIAL_KEYSPAN=m
> CONFIG_USB_SERIAL_KEYSPAN_MPR=y
> CONFIG_USB_SERIAL_KEYSPAN_USA28=y
> CONFIG_USB_SERIAL_KEYSPAN_USA28X=y
> CONFIG_USB_SERIAL_KEYSPAN_USA28XA=y
> CONFIG_USB_SERIAL_KEYSPAN_USA28XB=y
> CONFIG_USB_SERIAL_KEYSPAN_USA19=y
> CONFIG_USB_SERIAL_KEYSPAN_USA18X=y
> CONFIG_USB_SERIAL_KEYSPAN_USA19W=y
> CONFIG_USB_SERIAL_KEYSPAN_USA19QW=y
> CONFIG_USB_SERIAL_KEYSPAN_USA19QI=y
> CONFIG_USB_SERIAL_KEYSPAN_USA49W=y
> CONFIG_USB_SERIAL_KEYSPAN_USA49WLC=y
> CONFIG_USB_SERIAL_KLSI=m
> CONFIG_USB_SERIAL_KOBIL_SCT=m
> CONFIG_USB_SERIAL_MCT_U232=m
> # CONFIG_USB_SERIAL_NAVMAN is not set
> CONFIG_USB_SERIAL_PL2303=m
> # CONFIG_USB_SERIAL_HP4X is not set
> CONFIG_USB_SERIAL_SAFE=m
> CONFIG_USB_SERIAL_SAFE_PADDED=y
> # CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
> # CONFIG_USB_SERIAL_TI is not set
> CONFIG_USB_SERIAL_CYBERJACK=m
> CONFIG_USB_SERIAL_XIRCOM=m
> # CONFIG_USB_SERIAL_OPTION is not set
> CONFIG_USB_SERIAL_OMNINET=m
> CONFIG_USB_EZUSB=y
> 
> #
> # USB Miscellaneous drivers
> #
> CONFIG_USB_EMI62=m
> # CONFIG_USB_EMI26 is not set
> CONFIG_USB_AUERSWALD=m
> CONFIG_USB_RIO500=m
> CONFIG_USB_LEGOTOWER=m
> CONFIG_USB_LCD=m
> CONFIG_USB_LED=m
> # CONFIG_USB_CYPRESS_CY7C63 is not set
> # CONFIG_USB_CYTHERM is not set
> # CONFIG_USB_PHIDGETKIT is not set
> CONFIG_USB_PHIDGETSERVO=m
> # CONFIG_USB_IDMOUSE is not set
> # CONFIG_USB_APPLEDISPLAY is not set
> # CONFIG_USB_SISUSBVGA is not set
> # CONFIG_USB_LD is not set
> CONFIG_USB_TEST=m
> 
> #
> # USB DSL modem support
> #
> CONFIG_USB_ATM=m
> CONFIG_USB_SPEEDTOUCH=m
> # CONFIG_USB_CXACRU is not set
> # CONFIG_USB_UEAGLEATM is not set
> # CONFIG_USB_XUSBATM is not set
> 
> #
> # USB Gadget Support
> #
> # CONFIG_USB_GADGET is not set
> 
> #
> # MMC/SD Card support
> #
> # CONFIG_MMC is not set
> 
> #
> # LED devices
> #
> # CONFIG_NEW_LEDS is not set
> 
> #
> # LED drivers
> #
> 
> #
> # LED Triggers
> #
> 
> #
> # InfiniBand support
> #
> CONFIG_INFINIBAND=m
> CONFIG_INFINIBAND_USER_MAD=m
> CONFIG_INFINIBAND_USER_ACCESS=m
> CONFIG_INFINIBAND_ADDR_TRANS=y
> CONFIG_INFINIBAND_MTHCA=m
> CONFIG_INFINIBAND_MTHCA_DEBUG=y
> # CONFIG_IPATH_CORE is not set
> CONFIG_INFINIBAND_IPOIB=m
> CONFIG_INFINIBAND_IPOIB_DEBUG=y
> # CONFIG_INFINIBAND_IPOIB_DEBUG_DATA is not set
> CONFIG_INFINIBAND_SRP=m
> # CONFIG_INFINIBAND_ISER is not set
> 
> #
> # EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
> #
> CONFIG_EDAC=m
> 
> #
> # Reporting subsystems
> #
> # CONFIG_EDAC_DEBUG is not set
> CONFIG_EDAC_MM_EDAC=m
> CONFIG_EDAC_E752X=m
> CONFIG_EDAC_POLL=y
> 
> #
> # Real Time Clock
> #
> # CONFIG_RTC_CLASS is not set
> 
> #
> # DMA Engine support
> #
> # CONFIG_DMA_ENGINE is not set
> 
> #
> # DMA Clients
> #
> 
> #
> # DMA Devices
> #
> 
> #
> # Firmware Drivers
> #
> CONFIG_EDD=m
> CONFIG_DELL_RBU=m
> # CONFIG_DCDBAS is not set
> 
> #
> # File systems
> #
> CONFIG_EXT2_FS=y
> CONFIG_EXT2_FS_XATTR=y
> CONFIG_EXT2_FS_POSIX_ACL=y
> CONFIG_EXT2_FS_SECURITY=y
> # CONFIG_EXT2_FS_XIP is not set
> CONFIG_EXT3_FS=m
> CONFIG_EXT3_FS_XATTR=y
> CONFIG_EXT3_FS_POSIX_ACL=y
> CONFIG_EXT3_FS_SECURITY=y
> CONFIG_JBD=m
> # CONFIG_JBD_DEBUG is not set
> CONFIG_FS_MBCACHE=y
> CONFIG_REISERFS_FS=m
> # CONFIG_REISERFS_CHECK is not set
> # CONFIG_REISERFS_PROC_INFO is not set
> # CONFIG_REISERFS_FS_XATTR is not set
> CONFIG_JFS_FS=m
> CONFIG_JFS_POSIX_ACL=y
> CONFIG_JFS_SECURITY=y
> # CONFIG_JFS_DEBUG is not set
> # CONFIG_JFS_STATISTICS is not set
> CONFIG_FS_POSIX_ACL=y
> CONFIG_XFS_FS=m
> CONFIG_XFS_QUOTA=y
> CONFIG_XFS_SECURITY=y
> CONFIG_XFS_POSIX_ACL=y
> CONFIG_XFS_RT=y
> # CONFIG_OCFS2_FS is not set
> # CONFIG_MINIX_FS is not set
> # CONFIG_ROMFS_FS is not set
> CONFIG_INOTIFY=y
> CONFIG_INOTIFY_USER=y
> CONFIG_QUOTA=y
> # CONFIG_QFMT_V1 is not set
> CONFIG_QFMT_V2=y
> CONFIG_QUOTACTL=y
> CONFIG_DNOTIFY=y
> # CONFIG_AUTOFS_FS is not set
> CONFIG_AUTOFS4_FS=m
> # CONFIG_FUSE_FS is not set
> 
> #
> # CD-ROM/DVD Filesystems
> #
> CONFIG_ISO9660_FS=y
> CONFIG_JOLIET=y
> CONFIG_ZISOFS=y
> CONFIG_ZISOFS_FS=y
> CONFIG_UDF_FS=m
> CONFIG_UDF_NLS=y
> 
> #
> # DOS/FAT/NT Filesystems
> #
> CONFIG_FAT_FS=m
> CONFIG_MSDOS_FS=m
> CONFIG_VFAT_FS=m
> CONFIG_FAT_DEFAULT_CODEPAGE=437
> CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
> # CONFIG_NTFS_FS is not set
> 
> #
> # Pseudo filesystems
> #
> CONFIG_PROC_FS=y
> CONFIG_PROC_KCORE=y
> CONFIG_SYSFS=y
> CONFIG_TMPFS=y
> CONFIG_HUGETLBFS=y
> CONFIG_HUGETLB_PAGE=y
> CONFIG_RAMFS=y
> # CONFIG_CONFIGFS_FS is not set
> 
> #
> # Miscellaneous filesystems
> #
> # CONFIG_ADFS_FS is not set
> # CONFIG_AFFS_FS is not set
> CONFIG_HFS_FS=m
> CONFIG_HFSPLUS_FS=m
> # CONFIG_BEFS_FS is not set
> # CONFIG_BFS_FS is not set
> # CONFIG_EFS_FS is not set
> # CONFIG_JFFS_FS is not set
> CONFIG_JFFS2_FS=m
> CONFIG_JFFS2_FS_DEBUG=0
> CONFIG_JFFS2_FS_WRITEBUFFER=y
> # CONFIG_JFFS2_SUMMARY is not set
> # CONFIG_JFFS2_FS_XATTR is not set
> # CONFIG_JFFS2_COMPRESSION_OPTIONS is not set
> CONFIG_JFFS2_ZLIB=y
> CONFIG_JFFS2_RTIME=y
> # CONFIG_JFFS2_RUBIN is not set
> CONFIG_CRAMFS=m
> CONFIG_VXFS_FS=m
> # CONFIG_HPFS_FS is not set
> # CONFIG_QNX4FS_FS is not set
> # CONFIG_SYSV_FS is not set
> # CONFIG_UFS_FS is not set
> 
> #
> # Network File Systems
> #
> CONFIG_NFS_FS=m
> CONFIG_NFS_V3=y
> CONFIG_NFS_V3_ACL=y
> # CONFIG_NFS_V4 is not set
> # CONFIG_NFS_DIRECTIO is not set
> CONFIG_SUNRPC_XPRT_RDMA=m
> CONFIG_NFSD=m
> CONFIG_NFSD_V2_ACL=y
> CONFIG_NFSD_V3=y
> CONFIG_NFSD_V3_ACL=y
> # CONFIG_NFSD_V4 is not set
> CONFIG_NFSD_TCP=y
> CONFIG_NFSD_RDMA=y
> CONFIG_LOCKD=m
> CONFIG_LOCKD_V4=y
> CONFIG_EXPORTFS=m
> CONFIG_NFS_ACL_SUPPORT=m
> CONFIG_NFS_COMMON=y
> CONFIG_SUNRPC=m
> # CONFIG_RPCBIND_VERSION3 is not set
> # CONFIG_RPCSEC_GSS_KRB5 is not set
> # CONFIG_RPCSEC_GSS_SPKM3 is not set
> CONFIG_SMB_FS=m
> # CONFIG_SMB_NLS_DEFAULT is not set
> CONFIG_CIFS=m
> # CONFIG_CIFS_STATS is not set
> # CONFIG_CIFS_WEAK_PW_HASH is not set
> CONFIG_CIFS_XATTR=y
> CONFIG_CIFS_POSIX=y
> # CONFIG_CIFS_DEBUG2 is not set
> # CONFIG_CIFS_EXPERIMENTAL is not set
> # CONFIG_NCP_FS is not set
> # CONFIG_CODA_FS is not set
> # CONFIG_AFS_FS is not set
> # CONFIG_9P_FS is not set
> 
> #
> # Partition Types
> #
> CONFIG_PARTITION_ADVANCED=y
> # CONFIG_ACORN_PARTITION is not set
> CONFIG_OSF_PARTITION=y
> # CONFIG_AMIGA_PARTITION is not set
> # CONFIG_ATARI_PARTITION is not set
> CONFIG_MAC_PARTITION=y
> CONFIG_MSDOS_PARTITION=y
> CONFIG_BSD_DISKLABEL=y
> CONFIG_MINIX_SUBPARTITION=y
> CONFIG_SOLARIS_X86_PARTITION=y
> CONFIG_UNIXWARE_DISKLABEL=y
> # CONFIG_LDM_PARTITION is not set
> CONFIG_SGI_PARTITION=y
> # CONFIG_ULTRIX_PARTITION is not set
> CONFIG_SUN_PARTITION=y
> # CONFIG_KARMA_PARTITION is not set
> CONFIG_EFI_PARTITION=y
> 
> #
> # Native Language Support
> #
> CONFIG_NLS=y
> CONFIG_NLS_DEFAULT="utf8"
> CONFIG_NLS_CODEPAGE_437=y
> CONFIG_NLS_CODEPAGE_737=m
> CONFIG_NLS_CODEPAGE_775=m
> CONFIG_NLS_CODEPAGE_850=m
> CONFIG_NLS_CODEPAGE_852=m
> CONFIG_NLS_CODEPAGE_855=m
> CONFIG_NLS_CODEPAGE_857=m
> CONFIG_NLS_CODEPAGE_860=m
> CONFIG_NLS_CODEPAGE_861=m
> CONFIG_NLS_CODEPAGE_862=m
> CONFIG_NLS_CODEPAGE_863=m
> CONFIG_NLS_CODEPAGE_864=m
> CONFIG_NLS_CODEPAGE_865=m
> CONFIG_NLS_CODEPAGE_866=m
> CONFIG_NLS_CODEPAGE_869=m
> CONFIG_NLS_CODEPAGE_936=m
> CONFIG_NLS_CODEPAGE_950=m
> CONFIG_NLS_CODEPAGE_932=m
> CONFIG_NLS_CODEPAGE_949=m
> CONFIG_NLS_CODEPAGE_874=m
> CONFIG_NLS_ISO8859_8=m
> CONFIG_NLS_CODEPAGE_1250=m
> CONFIG_NLS_CODEPAGE_1251=m
> CONFIG_NLS_ASCII=y
> CONFIG_NLS_ISO8859_1=m
> CONFIG_NLS_ISO8859_2=m
> CONFIG_NLS_ISO8859_3=m
> CONFIG_NLS_ISO8859_4=m
> CONFIG_NLS_ISO8859_5=m
> CONFIG_NLS_ISO8859_6=m
> CONFIG_NLS_ISO8859_7=m
> CONFIG_NLS_ISO8859_9=m
> CONFIG_NLS_ISO8859_13=m
> CONFIG_NLS_ISO8859_14=m
> CONFIG_NLS_ISO8859_15=m
> CONFIG_NLS_KOI8_R=m
> CONFIG_NLS_KOI8_U=m
> CONFIG_NLS_UTF8=m
> 
> #
> # Instrumentation Support
> #
> CONFIG_PROFILING=y
> CONFIG_OPROFILE=m
> CONFIG_KPROBES=y
> 
> #
> # Kernel hacking
> #
> CONFIG_TRACE_IRQFLAGS_SUPPORT=y
> # CONFIG_PRINTK_TIME is not set
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_UNUSED_SYMBOLS=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_LOG_BUF_SHIFT=17
> CONFIG_DETECT_SOFTLOCKUP=y
> # CONFIG_SCHEDSTATS is not set
> # CONFIG_DEBUG_SLAB is not set
> # CONFIG_DEBUG_RT_MUTEXES is not set
> # CONFIG_RT_MUTEX_TESTER is not set
> CONFIG_DEBUG_SPINLOCK=y
> # CONFIG_DEBUG_MUTEXES is not set
> # CONFIG_DEBUG_RWSEMS is not set
> # CONFIG_DEBUG_LOCK_ALLOC is not set
> # CONFIG_PROVE_LOCKING is not set
> CONFIG_DEBUG_SPINLOCK_SLEEP=y
> # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
> # CONFIG_DEBUG_KOBJECT is not set
> CONFIG_DEBUG_INFO=y
> # CONFIG_DEBUG_FS is not set
> # CONFIG_DEBUG_VM is not set
> # CONFIG_FRAME_POINTER is not set
> # CONFIG_UNWIND_INFO is not set
> CONFIG_FORCED_INLINING=y
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_DEBUG_RODATA is not set
> # CONFIG_IOMMU_DEBUG is not set
> # CONFIG_DEBUG_STACKOVERFLOW is not set
> # CONFIG_DEBUG_STACK_USAGE is not set
> 
> #
> # Security options
> #
> CONFIG_KEYS=y
> CONFIG_KEYS_DEBUG_PROC_KEYS=y
> CONFIG_SECURITY=y
> CONFIG_SECURITY_NETWORK=y
> # CONFIG_SECURITY_NETWORK_XFRM is not set
> CONFIG_SECURITY_CAPABILITIES=y
> # CONFIG_SECURITY_ROOTPLUG is not set
> # CONFIG_SECURITY_SECLVL is not set
> CONFIG_SECURITY_SELINUX=y
> CONFIG_SECURITY_SELINUX_BOOTPARAM=y
> CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
> CONFIG_SECURITY_SELINUX_DISABLE=y
> CONFIG_SECURITY_SELINUX_DEVELOP=y
> CONFIG_SECURITY_SELINUX_AVC_STATS=y
> CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
> # CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT is not set
> 
> #
> # Cryptographic options
> #
> CONFIG_CRYPTO=y
> CONFIG_CRYPTO_HMAC=y
> CONFIG_CRYPTO_NULL=m
> CONFIG_CRYPTO_MD4=m
> CONFIG_CRYPTO_MD5=y
> CONFIG_CRYPTO_SHA1=y
> CONFIG_CRYPTO_SHA256=m
> CONFIG_CRYPTO_SHA512=m
> CONFIG_CRYPTO_WP512=m
> # CONFIG_CRYPTO_TGR192 is not set
> CONFIG_CRYPTO_DES=m
> CONFIG_CRYPTO_BLOWFISH=m
> CONFIG_CRYPTO_TWOFISH=m
> CONFIG_CRYPTO_SERPENT=m
> CONFIG_CRYPTO_AES=m
> # CONFIG_CRYPTO_AES_X86_64 is not set
> CONFIG_CRYPTO_CAST5=m
> CONFIG_CRYPTO_CAST6=m
> CONFIG_CRYPTO_TEA=m
> CONFIG_CRYPTO_ARC4=m
> CONFIG_CRYPTO_KHAZAD=m
> # CONFIG_CRYPTO_ANUBIS is not set
> CONFIG_CRYPTO_DEFLATE=m
> CONFIG_CRYPTO_MICHAEL_MIC=m
> CONFIG_CRYPTO_CRC32C=m
> # CONFIG_CRYPTO_TEST is not set
> 
> #
> # Hardware crypto devices
> #
> 
> #
> # Library routines
> #
> CONFIG_CRC_CCITT=m
> # CONFIG_CRC16 is not set
> CONFIG_CRC32=y
> CONFIG_LIBCRC32C=m
> CONFIG_ZLIB_INFLATE=y
> CONFIG_ZLIB_DEFLATE=m
> CONFIG_TEXTSEARCH=y
> CONFIG_TEXTSEARCH_KMP=m
> CONFIG_PLIST=y


From vuhuong at mellanox.com  Wed Dec 13 14:57:03 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Wed, 13 Dec 2006 14:57:03 -0800
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <1166049650.10873.9.camel@trinity.ogc.int>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
	<1166049650.10873.9.camel@trinity.ogc.int>
Message-ID: <4580853F.9070907@mellanox.com>


>>> 2. Can you please send me the iozone test parameters your using?
>>>
>> server has 8GB of mem, client has 2GB of mem
>>
>> iozone -r 64KB -s 5g -i 0 -i 1
>> and
>> iozone -r 64KB -s 2g -i 0 -i 1 -t 3
>>
> 
> Can you please send me the iozone output you get from these commands?

Here it is

-vu


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: iozone.output
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061213/74b8d0f8/attachment.ksh>

From vuhuong at mellanox.com  Wed Dec 13 15:02:01 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Wed, 13 Dec 2006 15:02:01 -0800
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <Pine.LNX.4.64.0612131552340.20796@jlentini-linux.nane.netapp.com>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
	<Pine.LNX.4.64.0612131552340.20796@jlentini-linux.nane.netapp.com>
Message-ID: <45808669.2040602@mellanox.com>

James Lentini wrote:
> 
> On Tue, 12 Dec 2006, Vu Pham wrote:
> 
>>>> 2.  While some clients run I/Os, one idle client try to access the mount
>>>> point ie. *ls* and get I/O input error. I see these error messages on
>>>> server log
> 
> Was there anything in the log before this point? I'd expect to see a 
> message started with "svcrdma: failed to post SQ..."

There is no such message in the log before this point


From Brian.Cain at ge.com  Wed Dec 13 15:15:51 2006
From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare))
Date: Wed, 13 Dec 2006 18:15:51 -0500
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <ada8xhbsahi.fsf@cisco.com>
Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com>

> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: Wednesday, December 13, 2006 4:21 PM
> To: Cain, Brian (GE Healthcare)
> Cc: openib-general at openib.org
> Subject: Re: [openib-general] [PATCH] install.sh: Cause less 
> pain to SRP users who didn't RTFM
> 
> 
>  > +                   echo '!!WARNING!! SRP is not supported 
> for 32-bit OS running on 64-bit capable hardware'
> 
> Did I miss something?  Why doesn't SRP work with 32-bit userspace on a
> 64-bit capable hardware?  In fact why doesn't it work with 32-bit
> userspace on a 64-bit kernel?

AFAICT, it's not a userspace/kernel issue, it's a hardware capability/OS
target issue.

>From srp_release_notes.txt:
~~~~~
========================================================================
======
11. Known Issues
========================================================================
======

- SRP is not supported on a 32-bit operating system running on a 64-bit
  platform.

~~~~~
Maybe the tests in the patch aren't appropriate for detecting this case,
but it looked right to me.

-BRian


From sashak at voltaire.com  Wed Dec 13 15:26:38 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 01:26:38 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061212122957.GC14622@mellanox.co.il>
References: <20061210225613.GF21155@sashak.voltaire.com>
	<20061212122957.GC14622@mellanox.co.il>
Message-ID: <20061213232638.GC14186@sashak.voltaire.com>

On 14:29 Tue 12 Dec     , Michael S. Tsirkin wrote:
> > For me it is unclear yet how long we may need this - 1.1 still be in
> > SVN yet, and 1.1 git branch is updated there.
> 
> By the way, one can't actually build OFED 1.1 userspace from git
> because OFED also applies some patches after checking things out
> from svn. They are here:
> https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes

I guess those patches should be committed in 1.1 svn branch (and imported
to git's 1.1). Any reason why it is not committed?

Sasha


From rdreier at cisco.com  Wed Dec 13 15:47:31 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 15:47:31 -0800
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com>
	(Brian Cain's message of "Wed, 13 Dec 2006 18:15:51 -0500")
References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBEFC0@CINMLVEM11.e2k.ad.ge.com>
Message-ID: <adaejr3qrxo.fsf@cisco.com>

 > - SRP is not supported on a 32-bit operating system running on a 64-bit
 >   platform.

Hmm, who wrote the release notes, and why was that put in?

I don't know of any reason why the SRP initiator wouldn't work in a
mixed 32-bit userspace / 64-bit kernel environment.

 - R.


From vuhuong at mellanox.com  Wed Dec 13 15:50:10 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Wed, 13 Dec 2006 15:50:10 -0800
Subject: [openib-general] nfsrdma release 7 issues,
In-Reply-To: <4580853F.9070907@mellanox.com>
References: <457F34B3.9060402@mellanox.com>
	<1165966574.8722.110.camel@trinity.ogc.int>
	<457F426B.7020104@mellanox.com>
	<1166049650.10873.9.camel@trinity.ogc.int>
	<4580853F.9070907@mellanox.com>
Message-ID: <458091B2.1030905@mellanox.com>

Tom,
   Here is the iozone output with same hw configuration; 
however, now server is running nfsrdma release 6, client is 
still running nfsrdma release 7

-vu

> 
>>>> 2. Can you please send me the iozone test parameters your using?
>>>>
>>> server has 8GB of mem, client has 2GB of mem
>>>
>>> iozone -r 64KB -s 5g -i 0 -i 1
>>> and
>>> iozone -r 64KB -s 2g -i 0 -i 1 -t 3
>>>
>>
>> Can you please send me the iozone output you get from these commands?
> 
> Here it is
> 
> -vu
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> [root at ibd001 ~]# cat /proc/meminfo 
> MemTotal:      2056688 kB
> MemFree:       1851248 kB
> Buffers:         12644 kB
> Cached:          91764 kB
> SwapCached:          0 kB
> Active:          69400 kB
> Inactive:        76536 kB
> HighTotal:           0 kB
> HighFree:            0 kB
> LowTotal:      2056688 kB
> LowFree:       1851248 kB
> SwapTotal:     4192924 kB
> SwapFree:      4192924 kB
> Dirty:            1048 kB
> Writeback:           4 kB
> AnonPages:       41584 kB
> Mapped:           6968 kB
> Slab:            26760 kB
> PageTables:       2072 kB
> NFS_Unstable:        0 kB
> Bounce:              0 kB
> CommitLimit:   5221268 kB
> Committed_AS:    71812 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed:      2500 kB
> VmallocChunk: 34359735671 kB
> HugePages_Total:     0
> HugePages_Free:      0
> HugePages_Rsvd:      0
> Hugepagesize:     2048 kB
> [root at ibd001 ~]# . /etc/nfsrdma-v7
> Doing nfs/rdma mount to 193.168.13.202, mount protocol to 193.168.13.202
> [root at ibd001 ~]# 
> [root at ibd001 ~]# 
> [root at ibd001 ~]# cd /vol-202
> [root at ibd001 vol-202]# iozone -r 64KB -s 5g -i 0 -i 1
>         Iozone: Performance Test of File I/O
>                 Version $Revision: 3.263 $
>                 Compiled for 32 bit mode.
>                 Build: linux 
> 
>         Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
>                      Al Slater, Scott Rhine, Mike Wisner, Ken Goss
>                      Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
>                      Randy Dunlap, Mark Montague, Dan Million, 
>                      Jean-Marc Zucconi, Jeff Blomberg,
>                      Erik Habbinga, Kris Strecker, Walter Wong.
> 
>         Run began: Wed Dec 13 14:36:18 2006
> 
>         Record Size 64 KB
>         File size set to 5242880 KB
>         Command line used: iozone -r 64KB -s 5g -i 0 -i 1
>         Output is in Kbytes/sec
>         Time Resolution = 0.000001 seconds.
>         Processor cache size set to 1024 Kbytes.
>         Processor cache line size set to 32 bytes.
>         File stride size set to 17 * record size.
>                                                             random  random    bkwd  record  stride                                   
>               KB  reclen   write rewrite    read    reread    read   write    read rewrite    read   fwrite frewrite   fread  freread
>          5242880      64  179970  257954   441693   485204                                                                          
> 
> iozone test complete.
> [root at ibd001 vol-202]#
> [root at ibd001 vol-202]# iozone -r 64KB -s 2g -i 0 -i 1 -t 3
>         Iozone: Performance Test of File I/O
>                 Version $Revision: 3.263 $
>                 Compiled for 32 bit mode.
>                 Build: linux 
> 
>         Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
>                      Al Slater, Scott Rhine, Mike Wisner, Ken Goss
>                      Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
>                      Randy Dunlap, Mark Montague, Dan Million, 
>                      Jean-Marc Zucconi, Jeff Blomberg,
>                      Erik Habbinga, Kris Strecker, Walter Wong.
> 
>         Run began: Wed Dec 13 14:39:41 2006
> 
>         Record Size 64 KB
>         File size set to 2097152 KB
>         Command line used: iozone -r 64KB -s 2g -i 0 -i 1 -t 3
>         Output is in Kbytes/sec
>         Time Resolution = 0.000001 seconds.
>         Processor cache size set to 1024 Kbytes.
>         Processor cache line size set to 32 bytes.
>         File stride size set to 17 * record size.
>         Throughput test with 3 processes
>         Each process writes a 2097152 Kbyte file in 64 Kbyte records
> 
>         Children see throughput for  3 initial writers  =  220949.31 KB/sec
>         Parent sees throughput for  3 initial writers   =  204066.05 KB/sec
>         Min throughput per process                      =   68142.53 KB/sec 
>         Max throughput per process                      =   82785.59 KB/sec
>         Avg throughput per process                      =   73649.77 KB/sec
>         Min xfer                                        = 1971712.00 KB
> 
>         Children see throughput for  3 rewriters        =  307993.49 KB/sec
>         Parent sees throughput for  3 rewriters         =  293288.28 KB/sec
>         Min throughput per process                      =   92883.50 KB/sec 
>         Max throughput per process                      =  119024.17 KB/sec
>         Avg throughput per process                      =  102664.50 KB/sec
>         Min xfer                                        = 1799616.00 KB
> 
>         Children see throughput for  3 readers          =  423371.39 KB/sec
>         Parent sees throughput for  3 readers           =  423168.28 KB/sec
>         Min throughput per process                      =  139781.50 KB/sec 
>         Max throughput per process                      =  142646.52 KB/sec
>         Avg throughput per process                      =  141123.80 KB/sec
>         Min xfer                                        = 2055232.00 KB
> 
>         Children see throughput for 3 re-readers        =  447745.98 KB/sec
>         Parent sees throughput for 3 re-readers         =  447678.57 KB/sec
>         Min throughput per process                      =  148235.48 KB/sec 
>         Max throughput per process                      =  149965.86 KB/sec
>         Avg throughput per process                      =  149248.66 KB/sec
>         Min xfer                                        = 2072512.00 KB
> 
> 
> 
> iozone test complete.
>  

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: iozone.v6.output
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061213/83b9b718/attachment.ksh>

From sweitzen at cisco.com  Wed Dec 13 16:14:45 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 13 Dec 2006 16:14:45 -0800
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B306D3@xmb-sjc-216.amer.cisco.com>

What problem are you seeing?  We have tested SRP on 32-bit SLES10
running on 64-bit Opteron hardware.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of Cain, 
> Brian (GE Healthcare)
> Sent: Wednesday, December 13, 2006 2:09 PM
> To: openib-general at openib.org
> Subject: [openib-general] [PATCH] install.sh: Cause less pain 
> to SRP users who didn't RTFM
> 
> There's gotta be a good way to let people know they're going down the
> wrong path on this one.
> 
> Signed-off-by: Brian Cain <Brian.Cain at ge.com>
> 
> --- ofed/openib/scripts/install.sh      2006-12-13 14:48:51.747995000
> -0700
> +++ ofed_fix/openib/scripts/install.sh  2006-12-13 14:59:00.586574000
> -0700
> @@ -1070,6 +1070,14 @@
>                          echo "# Load SDP module" >>
> ${IB_CONF_DIR}/openib.conf
>                          echo "# SDP_LOAD=no" >>
> ${IB_CONF_DIR}/openib.conf
>                  fi
> +
> +
> +                if [[ "$srp" == "y" || "$srp_target" == "y" ]] &&
> +                   [[ $(egrep 'flags.*lm' /proc/cpuinfo | wc 
> -l) > 0 ]]
> &&
> +                   [[ $(uname -p | egrep 'i[3-9]86' | wc -l) > 0 ]];
> then
> +                   echo '!!WARNING!! SRP is not supported 
> for 32-bit OS
> running on 64-bit capable hardware'
> +                fi
> +
> 
>                  if [ "$srp" == "y" ]; then
>                          echo >> ${IB_CONF_DIR}/openib.conf
> 
> --
> -Brian 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 


From akepner at sgi.com  Wed Dec 13 16:29:52 2006
From: akepner at sgi.com (akepner at sgi.com)
Date: Wed, 13 Dec 2006 16:29:52 -0800 (PST)
Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race
Message-ID: <Pine.LNX.4.61.0612131626250.24974@localhost.localdomain>


It appears that there are races between DMA and CQ updates
which can result in incorrect behavior when CQs are allocated
in user-space (via libibverbs).

This problem affects Altix in particular, though it may exist
on other platforms as well. (We haven't really seen this
particular bug yet but, based on previous experience, it's
something that we expect to be manifested on large NUMA
systems.)


Description of the race
-----------------------

On a system such as Altix, that supports "posted DMA", DMA
may complete out of order. (This is due to possible reordering
within the NUMA-interconnect. So it's not a PCI reordering
that's being described here.)

For example, if an HCA does a DMA write to host memory and
then updates a corresponding CQE, it's possible for
the CQE update to be visible before the DMA has actually
completed.

There are a couple of mechanisms to ensure synchronization.
Either: 1) an interrupt, or 2) a write to a "consistently"
(coherently) mapped DMA address will flush in-flight DMA.

When the CQ is allocated by the device driver, mechanism 2)
will prevent the race since "dma_alloc_consistent()" is used
there. But when the CQ allocation is done in user space (via
libibverbs) there's no protection.


So what to do?
-------------

Obviously mechanism 1), generating an interrupt, is not
the right solution for performance reasons.

One proposal is to add a kernel API that enables "coherent
memory" allocation (via the in-kernel DMA interface) from
user-space. Then, CQs, e.g., could be allocated via this
interface, and the race could be avoided.

Any other ideas?


-- 
Arthur


From yhkim93 at keti.re.kr  Wed Dec 13 17:16:18 2006
From: yhkim93 at keti.re.kr (=?ks_c_5601-1987?B?sei/tciv?=)
Date: Thu, 14 Dec 2006 10:16:18 +0900
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
In-Reply-To: <adapsansauk.fsf@cisco.com>
Message-ID: <20061214011631.656303B0006@sentry-two.sandia.gov>

I am making the infiniband storage system based on ppc. And I use AMCC 440
SPe yucca board. I have cross-compiled infiniband source to ppc. And I
applied to patch because of short of coherent dma memory. But after
compiling patched kernel source, happened the following error text.

What is problem?

 
===========================================================================

Waiting for PHY auto negotiation to complete... done

ENET Speed is 1000 Mbps - FULL duplex connection

Using ppc_4xx_eth0 device

TFTP from server 192.168.1.1; our IP address is 192.168.1.10

Filename 'yucca/uImage'.

Load address: 0x200000

Loading: T #################################################################

         #################################################################

         #################################################################

         #########################################################

done

Bytes transferred = 1289218 (13ac02 hex)

## Booting image at 00200000 ...

   Image Name:   Linux-2.6.19

   Image Type:   PowerPC Linux Kernel Image (gzip compressed)

   Data Size:    1289154 Bytes =  1.2 MB

   Load Address: 00000000

   Entry Point:  00000000

   Verifying Checksum ... OK

   Uncompressing Kernel Image ... OK

Linux version 2.6.19 (root at yhkim-devpc) (gcc version 4.0.0) #14 Thu Dec 14
09:43:16 KST 2006

PCIE:1 successfully set as rootpoint

vendor-id 0xaaa1

device-id 0xbed1

Yucca port (Roland Dreier <rolandd at cisco.com>)

Zone PFN ranges:

  DMA             0 ->   196608

  Normal     196608 ->   196608

early_node_map[1] active PFN ranges

    0:        0 ->   196608

Built 1 zonelists.  Total pages: 195072

Kernel command line: root=/dev/nfs rw
nfsroot=192.168.1.1:/tftpboot/yucca/ppc_4xx
ip=192.168.1.10:192.168.1.1::255.250PID hash table entries: 4096 (order:
12, 16384 bytes)

Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)

Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)

Memory: 776704k available (1976k kernel code, 612k data, 124k init, 0k
highmem)

Mount-cache hash table entries: 512

NET: Registered protocol family 16

PCI: Probing PCI hardware

NET: Registered protocol family 2

IP route cache hash table entries: 32768 (order: 5, 131072 bytes)

TCP established hash table entries: 131072 (order: 7, 524288 bytes)

TCP bind hash table entries: 65536 (order: 6, 262144 bytes)

TCP: Hash tables configured (established 131072 bind 65536)

TCP reno registered

io scheduler noop registered

io scheduler anticipatory registered (default)

io scheduler deadline registered

io scheduler cfq registered

Generic RTC Driver v1.07

Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled

serial8250: ttyS0 at MMIO 0x0 (irq = 0) is a 16550A

serial8250: ttyS1 at MMIO 0x0 (irq = 1) is a 16550A

serial8250: ttyS2 at MMIO 0x0 (irq = 37) is a 16550A

RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize

PPC 4xx OCP EMAC driver, version 3.54

mal0: initialized, 1 TX channels, 1 RX channels

eth0: emac0, MAC 00:04:ac:01:ca:fe

eth0: found CIS8201 Gigabit Ethernet PHY (0x01)

IBM IIC driver v2.1

ibm-iic0: using standard (100 kHz) mode

ibm-iic1: using standard (100 kHz) mode

ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006)

ib_mthca: Initializing 0001:01:01.0

ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100),
aborting.

ib_mthca 0001:01:01.0: BIOS or ACPI interrupt routing problem?

ib_mthca: probe of 0001:01:01.0 failed with error -16

TCP cubic registered

NET: Registered protocol family 1

NET: Registered protocol family 17

eth0: link is up, 1000 FDX

IP-Config: Complete:

      device=eth0, addr=192.168.1.10, mask=255.255.255.0,
gw=255.255.255.255,

     host=yucca, domain=, nis-domain=(none),

     bootserver=192.168.1.1, rootserver=192.168.1.1, rootpath=

Looking up port of RPC 100003/2 on 192.168.1.1

Looking up port of RPC 100005/1 on 192.168.1.1

VFS: Mounted root (nfs filesystem).

Freeing unused kernel memory: 124k init

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061214/47c5eb81/attachment.html>

From Brian.Cain at ge.com  Wed Dec 13 17:45:31 2006
From: Brian.Cain at ge.com (Cain, Brian (GE Healthcare))
Date: Wed, 13 Dec 2006 20:45:31 -0500
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B306D3@xmb-sjc-216.amer.cisco.com>
Message-ID: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com>

> -----Original Message-----
> From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen at cisco.com] 
> Sent: Wednesday, December 13, 2006 6:15 PM
> To: Cain, Brian (GE Healthcare); openib-general at openib.org
> Subject: RE: [openib-general] [PATCH] install.sh: Cause less 
> pain to SRP users who didn't RTFM
> 
> What problem are you seeing?  We have tested SRP on 32-bit SLES10
> running on 64-bit Opteron hardware.

We seem to get panics during multithreaded IO on the initiator.  The
panics don't seem to always point to any SRP code, but might be a
symptom of memory corruption.  It only seems to show up on 32-bit
kernels.  Our distro is a derivative of Fedora.  There were a few more
things we wanted to consider, but we stopped debugging when we saw an
indication in the release notes that it's not a supported configuration.

-Brian


From rdreier at cisco.com  Wed Dec 13 18:23:39 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 18:23:39 -0800
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com>
	(Brian Cain's message of "Wed, 13 Dec 2006 20:45:31 -0500")
References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com>
Message-ID: <ada64cfqkpg.fsf@cisco.com>

 > We seem to get panics during multithreaded IO on the initiator.  The
 > panics don't seem to always point to any SRP code, but might be a
 > symptom of memory corruption.  It only seems to show up on 32-bit
 > kernels.  Our distro is a derivative of Fedora.  There were a few more
 > things we wanted to consider, but we stopped debugging when we saw an
 > indication in the release notes that it's not a supported configuration.

I'm not sure who declared it "unsupported" and I would really like to
know what issue(s) led to that declaration.  Your report is the first
I've heard of anything like this, and I have to say that it seems
pretty implausible that running a 32-bit kernel on 64-bit-capable
hardware would be the source of problems -- if there is an issue then
I would expect it to be something to do with the 32-bit kernel.

In any case I definitely consider 32-bit kernels as something I
support, so if you could post a real bug report (what specific kernel
version, if you are running out-of-tree drivers (like OFED), host
server details, SRP target details, how to reproduce, etc) for your
problems with 32-bit kernels then I will try to debug things.

 - R.


From rdreier at cisco.com  Wed Dec 13 18:40:33 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 18:40:33 -0800
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
In-Reply-To: <5bg6vi$345u6o@sj-inbound-f.cisco.com> (
	=?iso-8859-1?Q?=B1=E8?= =?iso-8859-1?Q?=BF=B5=C8=AF?=
	<yhkim93@keti.re.kr 's message of "Thu, 14 Dec 2006 10:16:18 +0900")
References: <5bg6vi$345u6o@sj-inbound-f.cisco.com>
Message-ID: <ada1wn3qjxa.fsf@cisco.com>

What kernel are you running?  I can't find the messages:

 > PCIE:1 successfully set as rootpoint
 > vendor-id 0xaaa1
 > device-id 0xbed1

anywhere in my kernel sources.

Also, it seems you have your HCA in PCIE slot 1.  And you don't seem
to have any other PCI Express cards installed.  Is that correct?

If that is correct, then IRQ 100 should be the right IRQ, so I can't
explain why you would see

 > ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100), aborting.

Do you have any other PCI Express cards you can try?  Do they work?

What revision of Yucca/440SPe do you have?  I only have used a rev A
CPU, although I am getting my Yucca board reworked with a rev B CPU
this week.

 - R.


From k_mahesh85 at yahoo.co.in  Wed Dec 13 19:49:56 2006
From: k_mahesh85 at yahoo.co.in (keshetti mahesh)
Date: Thu, 14 Dec 2006 03:49:56 +0000 (GMT)
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <1166010208.28709.59772.camel@hal.voltaire.com>
Message-ID: <2875.47466.qm@web8317.mail.in.yahoo.com>

thanks for your reply,

>The driver is needed to obtain the information for the IB node to fill
>in the MADs for response to the SMA query. It may also issue some traps.
>Similarly for PMA as well.

Do u mean to say that HCA driver is needed to pass the HCA related information
(like GID,GUID, port_info etc..) to the SMA so that it can reply to query(or GET )
MADs.  Isn't SMA  capable of doing the same by using "query_(gid,pkey,port)"
verbs.

And final  questions  if it is really required to implement 'process_mad' in HCA driver then why it is not specified in the IB specifications.
Whose duty is this (replying to query MADs) according to the IB psec.s(its duty
of SMA right?)

I have observed that process_mad is not implemented in the IBM's eHCA driver.
what is the case with it?

PS: I am considering only SMA in the host s/w here.

regards,
K.Mahesh.


Hal Rosenstock <halr at voltaire.com> wrote: On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote:
> Hello all,
> 
> I want to know from u people that isi it necessary to implement the
> process_mad for a HCA.
> 
> After looking into the implementations of process_mad in ipath and
> mthca drivers i have fount that they are used to reply the MADs with
> port_info,gid_info,sm_info etc..
> 
> But isn't it handled by SMA in the host......

The SMA can either be in the host on in firmware (as is typical with the
Mellanox silicon).

> i am little bit confused now .
> please just whether  it is required to implement process_mad (suppose)
> for new HCA driver....

It is. For an example of a host (software SMA), see
drivers/infiniband/hw/ipath/ipath_mad.c

> if it is required  why?

The driver is needed to obtain the information for the IB node to fill
in the MADs for response to the SMA query. It may also issue some traps.
Similarly for PMA as well.

-- Hal

> Please CC your replies to me.
> 
> regards,
> K.Mahesh.
> 
> 
> 
> 
> 
> 
> 
> ______________________________________________________________________
>  Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


---------------------------------
 Find out what India is talking about on  - Yahoo! Answers India 
 Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8. Get it NOW
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061214/edb3206e/attachment.html>

From mst at mellanox.co.il  Wed Dec 13 22:04:57 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 08:04:57 +0200
Subject: [openib-general] [PATCH] mthca: move code from post send to
	post receive
In-Reply-To: <adazm9rqvja.fsf@cisco.com>
References: <adazm9rqvja.fsf@cisco.com>
Message-ID: <20061214060457.GG1689@mellanox.co.il>

>  > While unlikely to give a large gain, this makes sense to me.
> 
> Out of curiousity -- can you measure any difference at all with this?
> I would have guessed that the addition can be scheduled so that it
> costs nothing at all on any common CPU.

I didn't actually try to measure it.
But maybe it will all add up with time as small tuning adjustments are done.

> I guess it doesn't hurt though. Want to make a similar patch for libmthca?

Sure.

-- 
MST


From mst at mellanox.co.il  Wed Dec 13 22:19:51 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 08:19:51 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061213232638.GC14186@sashak.voltaire.com>
References: <20061213232638.GC14186@sashak.voltaire.com>
Message-ID: <20061214061951.GH1689@mellanox.co.il>

> > > For me it is unclear yet how long we may need this - 1.1 still be in
> > > SVN yet, and 1.1 git branch is updated there.
> > 
> > By the way, one can't actually build OFED 1.1 userspace from git
> > because OFED also applies some patches after checking things out
> > from svn. They are here:
> > https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes
> 
> I guess those patches should be committed in 1.1 svn branch (and imported
> to git's 1.1).

This could be done, but why invest the time?
And once we do touch the branch, who will test that the thing you
pull from there even works?

I would say that if you really want to mirror the OFED branch,
and make it buildable to some extent, the way to do this
would be to have a single git tree with all of OFED - patches,
scripts and all.

Oh, by the way, some tools in OFED tried to read an svn version
in their code, this wouldn't work on git.
And I don't see git trees for a lot of OFED bits - look at
https://openib.org/svn/gen2/branches/1.1/ofed/

What I am trying to say is, let's just keep SVN around and
do OFED 1.1 maintainance there. You can't fix the history.

> Any reason why it is not committed?

This was dicussed before OFED 1.1 and seems to have worked well so far.

We tried to keep our modifications to upstream as separate as possible -
this made transition to upstream in OFED 1.2 very easy as it was trivial
to check what was applied and what wasn't.

-- 
MST


From yhkim93 at keti.re.kr  Wed Dec 13 22:49:28 2006
From: yhkim93 at keti.re.kr (=?utf-8?B?6rmA7JiB7ZmY?=)
Date: Thu, 14 Dec 2006 15:49:28 +0900
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
In-Reply-To: <ada1wn3qjxa.fsf@cisco.com>
Message-ID: <20061214065005.99D203B0028@sentry-two.sandia.gov>

I used linux-2.6.19 supported by AMCC. I only installed HCA in PCIE slot1. 
And I don't have any other PCIE adapter. But I will test the other PCIE adapter. And I don't know exactly my cpu type. Only it is written 440SPe rev 2.0, p/n PPC440SPe-RGB533C in the chip. If so, what I have to compile official release linux-2.6.19 source?

Thank you for your helps.  


-----Original Message-----
From: Roland Dreier [mailto:rdreier at cisco.com] 
Sent: Thursday, December 14, 2006 11:41 AM
To: ±è¿µÈ¯
Cc: openib-general at openib.org
Subject: Re: booting problem after cross compile to ppc in infiniband source of linux-2.6.19

What kernel are you running?  I can't find the messages:

 > PCIE:1 successfully set as rootpoint
 > vendor-id 0xaaa1
 > device-id 0xbed1

anywhere in my kernel sources.

Also, it seems you have your HCA in PCIE slot 1.  And you don't seem
to have any other PCI Express cards installed.  Is that correct?

If that is correct, then IRQ 100 should be the right IRQ, so I can't
explain why you would see

 > ib_mthca 0001:01:01.0: NOP command failed to generate interrupt (IRQ 100), aborting.

Do you have any other PCI Express cards you can try?  Do they work?

What revision of Yucca/440SPe do you have?  I only have used a rev A
CPU, although I am getting my Yucca board reworked with a rev B CPU
this week.

 - R.


From eitan at sw053.yok.mtl.com  Wed Dec 13 23:11:11 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Thu, 14 Dec 2006 09:11:11 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal completion
Message-ID: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = ____  
ibutils rev = ____  
Total=264 Pass=261 Fail=3

Pass:
36 Stability IS1-16.topo
36 Pkey IS1-16.topo
36 Multicast IS1-16.topo
36 LidMgr IS1-16.topo
35 OsmStress IS1-16.topo
12 Stability IS3-loop.topo
12 Stability IS3-128.topo
12 Pkey IS3-128.topo
12 OsmStress IS3-128.topo
12 Multicast IS3-loop.topo
11 Multicast IS3-128.topo
11 LidMgr IS3-128.topo

Failures:
1 OsmStress IS1-16.topo
1 Multicast IS3-128.topo
1 LidMgr IS3-128.topo


From rdreier at cisco.com  Wed Dec 13 23:32:49 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 23:32:49 -0800
Subject: [openib-general] booting problem after cross compile to ppc in
 infiniband source of linux-2.6.19
In-Reply-To: <5g725m$m2afa@sj-inbound-a.cisco.com> (
	=?iso-8859-1?Q?=EA=B9?= =?iso-8859-1?Q?=80=EC=98=81=ED=99=98?=
	<yhkim93@keti.re.kr 's message of "Thu, 14 Dec 2006 15:49:28 +0900")
References: <5g725m$m2afa@sj-inbound-a.cisco.com>
Message-ID: <adahcvzortq.fsf@cisco.com>


 > I used linux-2.6.19 supported by AMCC. I only installed HCA in PCIE slot1. 
 > And I don't have any other PCIE adapter. But I will test the other PCIE adapter. And I don't know exactly my cpu type. Only it is written 440SPe rev 2.0, p/n PPC440SPe-RGB533C in the chip. If so, what I have to compile official release linux-2.6.19 source?

If AMCC is supporting your kernel you might want to ask them for help.

The part number "RGB" means you have a rev B part (the "B").  You can
also look in /proc/cpuinfo to see the PVR to find out what chip you
have.

To use the standard 2.6.19 kernel on a rev B chip you need the patches
I posted to linuxppc-embedded a few weeks ago.  Otherwise PCIe won't
work.  The patches aren't merged upstream yet because they still need
some cleanup.

 - R.


From rdreier at cisco.com  Wed Dec 13 23:41:41 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Wed, 13 Dec 2006 23:41:41 -0800
Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race
In-Reply-To: <Pine.LNX.4.61.0612131626250.24974@localhost.localdomain> (
	akepner@sgi.com's message of "Wed, 13 Dec 2006 16:29:52 -0800 (PST)")
References: <Pine.LNX.4.61.0612131626250.24974@localhost.localdomain>
Message-ID: <ada8xhaq5ze.fsf@cisco.com>

Are there other possible ordering problems involving user memory (not
in a CQ or QP)?  Something like a CPU on node A writing to memory on
node B and then posting a work request that makes the HCA DMA from
that memory on node B, and having the work request doorbell reach the
HCA before the write to node B actually happens, so the HCA DMAs the
old contents of node B's memory?

I guess the only feasible solution to the problem you're pointing out
is to have libmthca use some special mmap()-based allocator for queues
so that the kernel can give it memory that has the special
dma_map_consistent treatment.

Ugh.

 - R.


From ogerlitz at voltaire.com  Thu Dec 14 00:19:13 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 14 Dec 2006 10:19:13 +0200
Subject: [openib-general] [was: GIT PULL]
In-Reply-To: <adavekfqvhd.fsf@cisco.com>
References: <ada8xhctztu.fsf@cisco.com> <457FB82B.4090902@voltaire.com>
	<adavekfqvhd.fsf@cisco.com>
Message-ID: <45810901.3090209@voltaire.com>

Roland Dreier wrote:
>> you have CC-ed lkml at cisco.com on this email, is there a chance you
>> wanted to CC linux-kernel at vger.kernel.org instead ...

> Yep, a typo caused by my auto-expand not triggering.  No big deal though...

Indeed, I see now that Linus has pulled it

>> May i ask what prevented the v3 of the mthca profile patch (see
>> http://article.gmane.org/gmane.linux.drivers.openib/34005) to get in?

> The patch as posted is both ugly and wrong.  I still plan to fix it up
> and merge it for 2.6.20, but I didn't get a chance yet.

mmm, I understand all the comments raised during the review were fixed 
in the V3 post below, and now you say its both wrong and ugly... for 
example what's wrong here?

Or.

> Adds module parameters that enable settting some of the HCA
> profile values
> Signed-off-by: Leonid Arsh <leonida at voltaire.com>
> Signed-off-by: Moni Shoua <monis at voltaire.com>
> ---
>  mthca_main.c |  115 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 104 insertions(+), 11 deletions(-)
> diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
> index 47ea021..deb0289 100644
> --- a/drivers/infiniband/hw/mthca/mthca_main.c
> +++ b/drivers/infiniband/hw/mthca/mthca_main.c
> @@ -82,21 +82,110 @@ MODULE_PARM_DESC(tune_pci, "increase PCI
>  
>  struct mutex mthca_device_mutex;
>  
> +#define MTHCA_DEFAULT_NUM_QP            (1 << 16)
> +#define MTHCA_DEFAULT_RDB_PER_QP        (1 << 2)
> +#define MTHCA_DEFAULT_NUM_CQ            (1 << 16)
> +#define MTHCA_DEFAULT_NUM_MCG           (1 << 13)
> +#define MTHCA_DEFAULT_NUM_MPT           (1 << 17)
> +#define MTHCA_DEFAULT_NUM_MTT           (1 << 20)
> +#define MTHCA_DEFAULT_NUM_UDAV          (1 << 15)
> +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18)
> +#define MTHCA_DEFAULT_NUM_UARC_SIZE     (1 << 18)
> +
> +static struct mthca_profile default_profile = {
> +	.num_qp             = MTHCA_DEFAULT_NUM_QP,
> +	.rdb_per_qp         = MTHCA_DEFAULT_RDB_PER_QP,
> +	.num_cq             = MTHCA_DEFAULT_NUM_CQ,
> +	.num_mcg            = MTHCA_DEFAULT_NUM_MCG,
> +	.num_mpt            = MTHCA_DEFAULT_NUM_MPT,
> +	.num_mtt            = MTHCA_DEFAULT_NUM_MTT,
> +	.num_udav           = MTHCA_DEFAULT_NUM_UDAV,          /* Tavor only */
> +	.fmr_reserved_mtts  = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */
> +	.uarc_size          = MTHCA_DEFAULT_NUM_UARC_SIZE,     /* Arbel only */
> +};
> +
> +module_param_named(num_qp, default_profile.num_qp, int, 0444);
> +MODULE_PARM_DESC(num_qp, "maximum number of available QPs per HCA");
> +
> +module_param_named(rdb_per_qp, default_profile.rdb_per_qp, int, 0444);
> +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP");
> +
> +module_param_named(num_cq, default_profile.num_cq, int, 0444);
> +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA");
> +
> +module_param_named(num_mcg, default_profile.num_mcg, int, 0444);
> +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA");
> +
> +module_param_named(num_mpt, default_profile.num_mpt, int, 0444);
> +MODULE_PARM_DESC(num_mpt, 
> +		"maximum number of memory protection pable entries per HCA");
> +
> +module_param_named(num_mtt, default_profile.num_mtt, int, 0444);
> +MODULE_PARM_DESC(num_mtt,
> +		 "maximum number of memory translation table segments per HCA");
> +/* Tavor only */
> +module_param_named(num_udav, default_profile.num_udav, int, 0444);
> +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA");
> +
> +/* Tavor only */
> +module_param_named(fmr_reserved_mtts, default_profile.fmr_reserved_mtts, int, 0444);
> +MODULE_PARM_DESC(fmr_reserved_mtts,
> +		 "number of memory translation table segments reserved for FMR");
> +
>  static const char mthca_version[] __devinitdata =
>  	DRV_NAME ": Mellanox InfiniBand HCA driver v"
>  	DRV_VERSION " (" DRV_RELDATE ")\n";
>  
> -static struct mthca_profile default_profile = {
> -	.num_qp		   = 1 << 16,
> -	.rdb_per_qp	   = 4,
> -	.num_cq		   = 1 << 16,
> -	.num_mcg	   = 1 << 13,
> -	.num_mpt	   = 1 << 17,
> -	.num_mtt	   = 1 << 20,
> -	.num_udav	   = 1 << 15,	/* Tavor only */
> -	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
> -	.uarc_size	   = 1 << 18,	/* Arbel only */
> -};
> +
> +static int __devinit mthca_check_profile_value(int* pval, int pval_default){
> +	/* value must be positive and power of 2 */
> +	int old_pval = *pval;
> +
> +	if (old_pval <= 0)
> +		*pval = pval_default;
> +	else
> +		*pval = roundup_pow_of_two(old_pval);
> +
> +	return old_pval-*pval;
> +}
> +
> +#define mthca_check_profile_and_warn(name, var, defval) \
> +	if (mthca_check_profile_value(&var, defval)) \
> +		mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); 
> +
> +static int __devinit mthca_validate_profile(struct mthca_dev *mdev,
> +                                            struct mthca_profile *profile)
> +{
> +
> +	mthca_check_profile_and_warn(num_qp, default_profile.num_qp,
> +						 MTHCA_DEFAULT_NUM_QP);
> +	mthca_check_profile_and_warn(rdb_per_qp, default_profile.rdb_per_qp,
> +						 MTHCA_DEFAULT_RDB_PER_QP);
> +	mthca_check_profile_and_warn(num_cq, default_profile.num_cq,
> +						 MTHCA_DEFAULT_NUM_CQ);
> +	mthca_check_profile_and_warn(num_mcg, default_profile.num_mcg,
> +						 MTHCA_DEFAULT_NUM_MCG);
> +	mthca_check_profile_and_warn(num_mpt, default_profile.num_mpt,
> +						 MTHCA_DEFAULT_NUM_MPT);
> +	mthca_check_profile_and_warn(num_mtt, default_profile.num_mtt,
> +						 MTHCA_DEFAULT_NUM_MTT);
> +
> +	if (!mthca_is_memfree(mdev)) {
> +		mthca_check_profile_and_warn(num_udav, default_profile.num_udav,
> +							 MTHCA_DEFAULT_NUM_UDAV);
> +		mthca_check_profile_and_warn(fmr_reserved_mtts, default_profile.fmr_reserved_mtts,
> +							 MTHCA_DEFAULT_NUM_RESERVED_MTTS);
> +
> +		if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) {
> +			mthca_err(mdev, "Invalid fmr_reserved_mtts parameter" 
> +					  "value (%d). Must be lower then num_mtt (%d)\n",
> +					  default_profile.fmr_reserved_mtts,
> +					  default_profile.num_mtt ); 
> +			return -EINVAL;
> +		}
> +	}
> +	return 0;
> +}
>  
>  static int __devinit mthca_tune_pci(struct mthca_dev *mdev)
>  {
> @@ -1084,6 +1173,10 @@ static int __mthca_init_one(struct pci_d
>  	if (err)
>  		goto err_cmd;
>  
> +	err = mthca_validate_profile(mdev, &default_profile);
> +	if (err)
> +		goto err_cmd;
> +
>  	err = mthca_init_hca(mdev);
>  	if (err)
>  		goto err_cmd;


From yosefe at voltaire.com  Thu Dec 14 02:19:16 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Thu, 14 Dec 2006 12:19:16 +0200
Subject: [openib-general] ofed backports update
In-Reply-To: <20061211144813.GA15870@mellanox.co.il>
References: <20061211144813.GA15870@mellanox.co.il>
Message-ID: <1166091556.926.17.camel@muscida>

On Mon, 2006-12-11 at 16:48 +0200, Michael S. Tsirkin wrote:
> Here's a small update on OFED 1.2 backports. This describes a change
> I did a couple of weeks ago but never got to documenting.
> NOTE: This info is relevant only for people developing OFED kernel code,
> everything is transparent for others.
> 
> NOTE: This is by *no means* a comprehensive writeup of OFED build process -
> just a small update for people familiar with development in OFED 1.1.
> 
> Background:
> OFED 1.1 did all backports by applying patches under
> kernel_patches/backports/<kernel version>/ directory.
> To back-port a package, you just stuck a patch there
> and one OFED detected an appropriate kernel, it was applied before build.
> In many cases - where the kernel we are back-porting to was simply
> missing some macro - what patch actually did was just add a file
> under the include directory, and OFED build scripts knew to pick
> these up before standard linux includes.
> Managing these became somewhat of a pain as it is often hard to
> see the history of a patch: try git diff on a patch that sits in git tree
> and see what I mean.
> 
> Update:
> So for OFED 1.2 I've created a new directory kernel_addons, and converted
> all patches that created new files to plain files under the relevant
> kernel directory.  OFED scripts now look there for files before standard
> Linux headers.
> For an example, look at how backport to 2.6.18 looks:
> http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD
> Unfortunately, not all patches are of this form - some really tweak source
> inside the infiniband subtree - but we can strive to reduce the number of this
> and in this way make maintaining backports more of a seamless process.
> 
> Bottom line
> There are now 2 mechanisms for back-porting in OFED:
> - if you want to add a kernel-specific file, stick it under
>   kernel_addons/backport/<kernel-version>/.
> - if you must change an existing file depending on kernel version, stick
>   a patch in kernel_patches/backports/<kernel version>/.
> 

I was running the ‘configure’ script under ofed root.

In ofed 1.1, it is possible to run configure without flags to patch the
sources, and then run it again –without-patches and with the desired
flags.

In ofed 1.2 (Vlad’s tree) this scenario causes compilation error while
running ‘make’ afterwards (2.6.9-34ELsmp and on 2.6.16.21-0.8, but NOT
2.6.19) causes compilation errors later on.

However, when I just ran configure on a fresh source, with all the
desired flags, it worked just fine.

It seems to happen because the configure only patches Makefiles with the
selected components with the kernel-addons include path.

Maybe it should patch all Makefiles, or copy the files to ./include?


_______________________________________________________________
Yosef Etigin, ib-host-stack
Voltaire – The Grid Backbone
www.voltaire.com


From tziporet at dev.mellanox.co.il  Thu Dec 14 03:30:24 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 14 Dec 2006 13:30:24 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <45804021.9050209@hp.com>
References: <0b8901c71ed3$e9b9f740$0281a8c0@ebpc>
 <45804021.9050209@hp.com>
Message-ID: <458135D0.6090100@dev.mellanox.co.il>

Philippe Bernadat wrote:
> Roland,
>
> Attached are the two lspci outputs.
>
> The only differences I see are:
>
> [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed
> 1d0
> < pcilib: Resource 5 in /sys/bus/pci/devices/0000:00:1f.1/resource has 
> a 64-bit address, ignoring
> 40c39
> < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> ---
> > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> [philippe at hamish o2ib]$
>
Have you tried running with

options ib_mthca tune_pci =1

Tziporet


From halr at voltaire.com  Thu Dec 14 04:12:04 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 07:12:04 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
	completion
In-Reply-To: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
Message-ID: <1166098306.28709.122104.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = ____  
> ibutils rev = ____  
> Total=264 Pass=261 Fail=3
> 
> Pass:
> 36 Stability IS1-16.topo
> 36 Pkey IS1-16.topo
> 36 Multicast IS1-16.topo
> 36 LidMgr IS1-16.topo
> 35 OsmStress IS1-16.topo
> 12 Stability IS3-loop.topo
> 12 Stability IS3-128.topo
> 12 Pkey IS3-128.topo
> 12 OsmStress IS3-128.topo
> 12 Multicast IS3-loop.topo
> 11 Multicast IS3-128.topo
> 11 LidMgr IS3-128.topo
> 
> Failures:
> 1 OsmStress IS1-16.topo
> 1 Multicast IS3-128.topo
> 1 LidMgr IS3-128.topo

There are now 2 more failures. You had previously explained OsmStress
failure as needing more investigation. Now there is a Multicast and
LidMgr failure yet nothing really changed since the previous run the
night before. Are these new tests ? What were the failures ?

The repetitions have also been reduced from previous reports. Are these
the same or different tests ?

-- Hal


From philippe_bernadat at hp.com  Thu Dec 14 04:24:04 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 13:24:04 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <458135D0.6090100@dev.mellanox.co.il>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net>


> Have you tried running with
> 
> options ib_mthca tune_pci =1
> 

My understanding is that this is not required anymore with OFED-1.1 - It
used to make a siginifciant differences with OFED-1.0, but I didn't
observe it with OFED-1.1

And again, the user mode performance if comparable between VIB and OFED.

Philippe

> -----Original Message-----
> From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
> Sent: Thursday, December 14, 2006 12:30 PM
> To: Bernadat, Philippe
> Cc: Eric Barton; Roland Dreier; Matt Leininger; 
> openib-general at openib.org; Bernadat, Philippe
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire
> 
> Philippe Bernadat wrote:
> > Roland,
> >
> > Attached are the two lspci outputs.
> >
> > The only differences I see are:
> >
> > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed
> > 1d0
> > < pcilib: Resource 5 in 
> /sys/bus/pci/devices/0000:00:1f.1/resource has 
> > a 64-bit address, ignoring
> > 40c39
> > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> > ---
> > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> > [philippe at hamish o2ib]$
> >
> Have you tried running with
> 
> options ib_mthca tune_pci =1
> 
> Tziporet
> 
> 


From ishai at dev.mellanox.co.il  Thu Dec 14 04:24:15 2006
From: ishai at dev.mellanox.co.il (ishai)
Date: Thu, 14 Dec 2006 14:24:15 +0200
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <ada64cfqkpg.fsf@cisco.com>
References: <2376B63A5AF8564F8A2A2D76BC6DB03301BBF03D@CINMLVEM11.e2k.ad.ge.com>
	<ada64cfqkpg.fsf@cisco.com>
Message-ID: <4581426F.2060106@dev.mellanox.co.il>

Hi Roland,

SRP was tested on a 32-bit operating system running on a 32-bit platform
and on 64-bit OS and there are no known problems.
In the interoperability tests done in UNH-IOL on September we found out
that SRP on a 32-bit operating system running on a 64-bit platform
causes crashes. (It was tested on RHEL4-U3).
Since we did not have enough time to solve this problem until the 
release and since we think that this combination (32-bit OS on 64-bit 
platform) is less common, we treat this issue as low priority.

The remark in the release notes indicates that SRP does not work on this
combination.

Ishai

Roland Dreier wrote:

> > We seem to get panics during multithreaded IO on the initiator.  The
> > panics don't seem to always point to any SRP code, but might be a
> > symptom of memory corruption.  It only seems to show up on 32-bit
> > kernels.  Our distro is a derivative of Fedora.  There were a few more
> > things we wanted to consider, but we stopped debugging when we saw an
> > indication in the release notes that it's not a supported configuration.
>
>I'm not sure who declared it "unsupported" and I would really like to
>know what issue(s) led to that declaration.  Your report is the first
>I've heard of anything like this, and I have to say that it seems
>pretty implausible that running a 32-bit kernel on 64-bit-capable
>hardware would be the source of problems -- if there is an issue then
>I would expect it to be something to do with the 32-bit kernel.
>
>In any case I definitely consider 32-bit kernels as something I
>support, so if you could post a real bug report (what specific kernel
>version, if you are running out-of-tree drivers (like OFED), host
>server details, SRP target details, how to reproduce, etc) for your
>problems with 32-bit kernels then I will try to debug things.
>
> - R.
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>


From monil at voltaire.com  Thu Dec 14 04:35:24 2006
From: monil at voltaire.com (Moni Levy)
Date: Thu, 14 Dec 2006 14:35:24 +0200
Subject: [openib-general] [openfabrics-ewg] OFED release testing Task
	Force
In-Reply-To: <1E3DCD1C63492545881FACB6063A57C19AA561@mtiexch01.mti.com>
References: <AccOZDPhtR1IamyBTDWbLBwb21Hvbw==>
	<1E3DCD1C63492545881FACB6063A57C19AA561@mtiexch01.mti.com>
Message-ID: <6a122cc00612140435k55b4e177se9c58279d7444603@mail.gmail.com>

Nimrod,
On 11/22/06, Nimrod Gindi <nimrodg at mellanox.com> wrote:
>
>
>
> Hi,
>
> As a follow-up on the presentation prepared and presented by Amit Krig and
> my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call
> for participation.
>
> The targets of the Ad-hoc task force will be (as agreed upon in the session
> we had): unify the test results formats, define release quality criteria,
> define/assign ULP verification owners and enhance interoperability
> finger-print in the release process.
>
>
>
> We would like to have a participant from each contributing company and would
> appreciate any response sent to me with a name of a person from the company
> to attend and take action on behalf of this task force.

 I'm sorry for the late reply. Yosi (yosefe at voltaire.com) and me will
be happy to join.

-- Moni

>
> BTW: I've also attached the presentation that was given in the OFA meeting.
>
>  <<OFED testing session.pps>>
>
> Happy Holidays to every one,
>
>
>
> Nimrod  Gindi
>
> Mellanox Technologies Ltd.
>
> mail  :  nimrodg at mellanox.com
>
> Cell  :  +1-408-750-4801
>
> Office:  +1-347-342-0011
>
> Fax   :  +1-212-987-0275
>
>
> _______________________________________________
> openfabrics-ewg mailing list
> openfabrics-ewg at openib.org
> http://openib.org/mailman/listinfo/openfabrics-ewg
>
>
>
>


From mst at mellanox.co.il  Thu Dec 14 04:46:29 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 14:46:29 +0200
Subject: [openib-general] [PATCH] mthca: save low memory used for reserved
	objects
Message-ID: <20061214124629.GB24840@mellanox.co.il>

We never need to allocate memory for reserved objects in low memory.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

I noticed this obvious optimization when going over the icm allocation code.

Index: linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ linux-2.6/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -313,8 +313,7 @@ struct mthca_icm_table *mthca_alloc_icm_
 			chunk_size = nobj * obj_size - i * MTHCA_TABLE_CHUNK_SIZE;
 
 		table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT,
-						(use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
-						__GFP_NOWARN);
+						 GFP_HIGHUSER | __GFP_NOWARN);
 		if (!table->icm[i])
 			goto err;
 		if (mthca_MAP_ICM(dev, table->icm[i], virt + i * MTHCA_TABLE_CHUNK_SIZE,

-- 
MST


From eitan at mellanox.co.il  Thu Dec 14 05:32:12 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Dec 2006 15:32:12 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
 completion
In-Reply-To: <1166098306.28709.122104.camel@hal.voltaire.com>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
	<1166098306.28709.122104.camel@hal.voltaire.com>
Message-ID: <4581525C.9060104@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>   
>> OSM Simulation Regression Summary
>> OpenSM rev = ____  
>> ibutils rev = ____  
>> Total=264 Pass=261 Fail=3
>>
>> Pass:
>> 36 Stability IS1-16.topo
>> 36 Pkey IS1-16.topo
>> 36 Multicast IS1-16.topo
>> 36 LidMgr IS1-16.topo
>> 35 OsmStress IS1-16.topo
>> 12 Stability IS3-loop.topo
>> 12 Stability IS3-128.topo
>> 12 Pkey IS3-128.topo
>> 12 OsmStress IS3-128.topo
>> 12 Multicast IS3-loop.topo
>> 11 Multicast IS3-128.topo
>> 11 LidMgr IS3-128.topo
>>
>> Failures:
>> 1 OsmStress IS1-16.topo
>> 1 Multicast IS3-128.topo
>> 1 LidMgr IS3-128.topo
>>     
>
> There are now 2 more failures. You had previously explained OsmStress
> failure as needing more investigation. Now there is a Multicast and
> LidMgr failure yet nothing really changed since the previous run the
> night before. Are these new tests ? What were the failures ?
>   
The tests use random seeds and thus can catch other bugs in each run.
I am investigating these failures. Some might be due to bugs in the 
checker code too.

Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
failed 1 test).
This to imply the bug is a hard to find one.
> The repetitions have also been reduced from previous reports. Are these
> the same or different tests ?
>   
Number of repetitions depends on runtime. The regression started later 
thus run less iterations.
I run the "same" tests ("same" means same code not same random sequence).
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From swise at opengridcomputing.com  Thu Dec 14 05:52:33 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:52:33 -0600
Subject: [openib-general] [PATCH  v4 00/13] 2.6.20 Chelsio T3 RDMA Driver
Message-ID: <20061214135233.21159.78613.stgit@dell3.ogc.int>


Roland, 

I think this is ready to go once the ethernet driver is pulled in.

Version 4 changes:

- Cleaned up spacing in the Kconfig file
- Remove locking.txt file - its not needed
- Remove -O1 from the debug config option
- BugFix: support new LLD interface for dual-port adapters

Version 3 changes:

- BugFix: Don't use mutex inside of the mmap function.
- BugFix: Move QP to TERMINATE when TERMINATE AE is processed
- Support the new work queue design
- Merged up to linus's tree as of 12/8/2006
- Misc nits

Version 2 changes:

- Make code sparse endian clean
- Use IDRs for mapping QP and CQ IDs to structure pointers instead
  of arrays
- Clean up confusing bitfields
- Use random32() instead of local random function
- Use krefs to track endpoint reference counts
- Misc nits

-----

The following series implements the Chelsio T3 iWARP/RDMA Driver to
be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
Ethernet driver which is also under review now for 2.6.20. 

The latest Chelsio T3 Ethernet driver patch can be pulled from:

	http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

A complete GIT kernel tree with all the T3 drivers can be pulled from:

	git://staging.openfabrics.org/~swise/cxgb3.git

Thanks,

Steve.


From swise at opengridcomputing.com  Thu Dec 14 05:53:05 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:53:05 -0600
Subject: [openib-general] [PATCH  v4 01/13] Linux RDMA Core Changes
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135303.21159.61880.stgit@dell3.ogc.int>


Support provider-specific data in ib_uverbs_cmd_req_notify_cq().
The Chelsio iwarp provider library needs to pass information to the
kernel verb for re-arming the CQ.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/uverbs_cmd.c      |    9 +++++++--
 drivers/infiniband/hw/amso1100/c2.h       |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c    |    3 ++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    3 ++-
 drivers/infiniband/hw/ehca/ehca_reqs.c    |    3 ++-
 drivers/infiniband/hw/ipath/ipath_cq.c    |    4 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |    3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c    |    6 ++++--
 drivers/infiniband/hw/mthca/mthca_dev.h   |    4 ++--
 include/rdma/ib_verbs.h                   |    5 +++--
 10 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..5dd1de9 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_udata		      udata;
 	struct ib_cq                  *cq;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 	if (!cq)
 		return -EINVAL;
 
-	ib_req_notify_cq(cq, cmd.solicited_only ?
-			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+	INIT_UDATA(&udata, buf + sizeof cmd, 0,
+		   in_len - sizeof cmd, 0); 
+
+	cq->device->req_notify_cq(cq, cmd.solicited_only ?
+				  IB_CQ_SOLICITED : IB_CQ_NEXT_COMP,
+				  &udata);
 
 	put_cq_read(cq);
 
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
index 04a9db5..9a76869 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..7ce8bca 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n
 	return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+	      struct ib_udata *udata)
 {
 	struct c2_mq_shared __iomem *shared;
 	struct c2_cq *cq;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 3720e30..566b30c 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			     struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..3ed6992 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,7 +634,8 @@ poll_cq_exit0:
 	return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 87462e0..27ba4db 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
  * @notify: the type of notification to request
+ * @udata: user data 
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
 	unsigned long flags;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index 8039f6e..0d39960 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_
 
 int ipath_destroy_cq(struct ib_cq *ibcq);
 
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata);
 
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 283d50b..15cbd49 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -722,7 +722,8 @@ repoll:
 	return err == 0 || err == -EAGAIN ? npolled : err;
 }
 
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
 {
 	__be32 doorbell[2];
 
@@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
 	return 0;
 }
 
-int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+		       struct ib_udata *udata)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	__be32 doorbell[2];
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index fe5cecf..6b9ccf6 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
 
 int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
+int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8eacc35..e3e1a2c 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -941,7 +941,8 @@ struct ib_device {
 					      struct ib_wc *wc);
 	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
 	int                        (*req_notify_cq)(struct ib_cq *cq,
-						    enum ib_cq_notify cq_notify);
+						    enum ib_cq_notify cq_notify,
+						    struct ib_udata *udata);
 	int                        (*req_ncomp_notif)(struct ib_cq *cq,
 						      int wc_cnt);
 	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
@@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
 static inline int ib_req_notify_cq(struct ib_cq *cq,
 				   enum ib_cq_notify cq_notify)
 {
-	return cq->device->req_notify_cq(cq, cq_notify);
+	return cq->device->req_notify_cq(cq, cq_notify, NULL);
 }
 
 /**


From swise at opengridcomputing.com  Thu Dec 14 05:53:35 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:53:35 -0600
Subject: [openib-general] [PATCH v4 02/13] Device Discovery and ULLD Linkage
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135335.21159.79371.stgit@dell3.ogc.int>


Code to discover all the T3 devices and register them 
with the T3 RDMA Core and the Linux RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch.c |  189 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch.h |  175 +++++++++++++++++++++++++++++++++
 2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 0000000..acbe449
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define DRV_VERSION "1.1"
+
+MODULE_AUTHOR("Boyd Faulkner, Steve Wise");
+MODULE_DESCRIPTION("Chelsio T3 RDMA Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_dev(struct t3cdev *);
+static void close_rnic_dev(struct t3cdev *);
+
+struct cxgb3_client t3c_client = {
+	.name = "iw_cxgb3",
+	.add = open_rnic_dev,
+	.remove = close_rnic_dev,
+	.handlers = t3c_handlers,
+	.redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static void rnic_init(struct iwch_dev *rnicp)
+{
+	PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
+	idr_init(&rnicp->cqidr);
+	idr_init(&rnicp->qpidr);
+	idr_init(&rnicp->mmidr);
+	spin_lock_init(&rnicp->lock);
+
+	rnicp->attr.vendor_id = 0x168;
+	rnicp->attr.vendor_part_id = 7;
+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+	rnicp->attr.max_wrs = (1UL << 24) - 1;
+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+	rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev);
+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
+	rnicp->attr.can_resize_wq = 0;
+	rnicp->attr.max_rdma_reads_per_qp = 8;
+	rnicp->attr.max_rdma_read_resources =
+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+	rnicp->attr.max_rdma_read_qp_depth = 8;	/* IRD */
+	rnicp->attr.max_rdma_read_depth =
+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+	rnicp->attr.rq_overflow_handled = 0;
+	rnicp->attr.can_modify_ird = 0;
+	rnicp->attr.can_modify_ord = 0;
+	rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
+	rnicp->attr.stag0_value = 1;
+	rnicp->attr.zbva_support = 1;
+	rnicp->attr.local_invalidate_fence = 1;
+	rnicp->attr.cq_overflow_detection = 1;
+	return;
+}
+
+static void open_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *rnicp;
+	static int vers_printed;
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	if (!vers_printed++) 
+		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+		       DRV_VERSION);
+	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+	if (!rnicp) {
+		printk(KERN_ERR MOD "Cannot allocate ib device\n");
+		return;
+	}
+	rnicp->rdev.ulp = rnicp;
+	rnicp->rdev.t3cdev_p = tdev;
+
+	if (cxio_rdev_open(&rnicp->rdev)) {
+		printk(KERN_ERR MOD "Unable to open CXIO rdev\n");
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	rnic_init(rnicp);
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&rnicp->entry, &dev_list);
+	mutex_unlock(&dev_mutex);
+
+	if (iwch_register_device(rnicp)) {
+		printk(KERN_ERR MOD "Unable to register device\n");
+		close_rnic_dev(tdev);
+	}
+	printk(KERN_INFO MOD "Initialized device %s\n",
+	       pci_name(rnicp->rdev.rnic_info.pdev));
+	return;
+}
+
+static void close_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *dev, *tmp;
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	mutex_lock(&dev_mutex);
+	list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
+		if (dev->rdev.t3cdev_p == tdev) {
+			list_del(&dev->entry);
+			iwch_unregister_device(dev);
+			cxio_rdev_close(&dev->rdev);
+			idr_destroy(&dev->cqidr);
+			idr_destroy(&dev->qpidr);
+			idr_destroy(&dev->mmidr);
+			ib_dealloc_device(&dev->ibdev);
+			break;
+		}
+	}
+	mutex_unlock(&dev_mutex);
+}
+
+extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb);
+
+static int __init iwch_init_module(void)
+{
+	int err;
+
+	err = cxio_hal_init();
+	if (err) 
+		return err;
+	err = iwch_cm_init();
+	if (err) 
+		return err;
+	cxio_register_ev_cb(iwch_ev_dispatch);
+	cxgb3_register_client(&t3c_client);
+	return 0;
+}
+
+static void __exit iwch_exit_module(void)
+{
+	cxgb3_unregister_client(&t3c_client);
+	cxio_unregister_ev_cb(iwch_ev_dispatch);
+	iwch_cm_term();
+	cxio_hal_exit();
+}
+
+module_init(iwch_init_module);
+module_exit(iwch_exit_module);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
new file mode 100644
index 0000000..752b6ad
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_H__
+#define __IWCH_H__
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/idr.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+
+struct iwch_pd;
+struct iwch_cq;
+struct iwch_qp;
+struct iwch_mr;
+
+struct iwch_rnic_attributes {
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 max_qps;
+	u32 max_wrs;				/* Max for any SQ/RQ */
+	u32 max_sge_per_wr;
+	u32 max_sge_per_rdma_write_wr;	/* for RDMA Write WR */
+	u32 max_cqs;
+	u32 max_cqes_per_cq;
+	u32 max_mem_regs;
+	u32 max_phys_buf_entries;		/* for phys buf list */
+	u32 max_pds;
+
+	/* 
+	 * The memory page sizes supported by this RNIC.
+	 * Bit position i in bitmap indicates page of
+	 * size (4k)^i.  Phys block list mode unsupported. 
+	 */
+	u32 mem_pgsizes_bitmask;
+	u8 can_resize_wq;
+
+	/*
+	 * The maximum number of RDMA Reads that can be outstanding 
+	 * per QP with this RNIC as the target. 
+	 */
+	u32 max_rdma_reads_per_qp;
+
+	/*
+	 * The maximum number of resources used for RDMA Reads
+	 * by this RNIC with this RNIC as the target. 
+	 */
+	u32 max_rdma_read_resources;
+
+	/*
+	 * The max depth per QP for initiation of RDMA Read
+	 * by this RNIC.  
+	 */
+	u32 max_rdma_read_qp_depth;
+
+	/*
+	 * The maximum depth for initiation of RDMA Read 
+	 * operations by this RNIC on all QPs 
+	 */
+	u32 max_rdma_read_depth;
+	u8 rq_overflow_handled;
+	u32 can_modify_ird;
+	u32 can_modify_ord;
+	u32 max_mem_windows;
+	u32 stag0_value;
+	u8 zbva_support;
+	u8 local_invalidate_fence;
+	u32 cq_overflow_detection;
+};
+
+struct iwch_dev {
+	struct ib_device ibdev;
+	struct cxio_rdev rdev;
+	u32 device_cap_flags;
+	struct iwch_rnic_attributes attr;
+	struct idr cqidr;
+	struct idr qpidr;
+	struct idr mmidr;
+	spinlock_t lock;
+	struct list_head entry;
+};
+
+static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct iwch_dev, ibdev);
+}
+
+static inline int t3b_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3B);
+}
+
+static inline int t3a_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3A);
+}
+
+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid)
+{
+	return idr_find(&rhp->cqidr, cqid);
+}
+
+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid)
+{
+	return idr_find(&rhp->qpidr, qpid);
+}
+
+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid)
+{
+	return idr_find(&rhp->mmidr, mmid);
+}
+
+static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr, 
+				void *handle, u32 id)
+{
+	int ret;
+	u32 newid;
+
+	do {
+		if (!idr_pre_get(idr, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		spin_lock_irq(&rhp->lock);
+		ret = idr_get_new_above(idr, handle, id, &newid);
+		BUG_ON(newid != id);
+		spin_unlock_irq(&rhp->lock);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
+{
+	spin_lock_irq(&rhp->lock);
+	idr_remove(idr, id);
+	spin_unlock_irq(&rhp->lock);
+}
+
+extern struct cxgb3_client t3c_client;
+extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:54:05 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:54:05 -0600
Subject: [openib-general] [PATCH v4 03/13] Provider Methods and Data
	Structures
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135405.21159.5811.stgit@dell3.ogc.int>


Provider methods to support the Linux RDMA verbs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  363 ++++++++
 drivers/infiniband/hw/cxgb3/iwch_user.h     |   68 ++
 3 files changed, 1602 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 0000000..e9721b1
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/ethtool.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <cxio_hal.h>
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+			    u8 port, int port_modify_mask,
+			    struct ib_port_modify *props)
+{
+	return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+				    struct ib_ah_attr *ah_attr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+			    int mad_flags,
+			    u8 port_num,
+			    struct ib_wc *in_wc,
+			    struct ib_grh *in_grh,
+			    struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+	struct iwch_dev *rhp = to_iwch_dev(context->device);
+	struct iwch_ucontext *ucontext = to_iwch_ucontext(context);
+	PDBG("%s context %p\n", __FUNCTION__, context);
+	cxio_release_ucontext(&rhp->rdev, &ucontext->uctx);
+	kfree(ucontext);
+	return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+					struct ib_udata *udata)
+{
+	struct iwch_ucontext *context;
+	struct iwch_dev *rhp = to_iwch_dev(ibdev);
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+	cxio_init_ucontext(&rhp->rdev, &context->uctx);
+	INIT_LIST_HEAD(&context->mmaps);
+	spin_lock_init(&context->mmap_lock);
+	return &context->ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct iwch_cq *chp;
+
+	PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq);
+	chp = to_iwch_cq(ib_cq);
+
+	remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid);
+	atomic_dec(&chp->refcnt);
+	wait_event(chp->wait, !atomic_read(&chp->refcnt));
+
+	cxio_destroy_cq(&chp->rhp->rdev, &chp->cq);
+	kfree(chp);
+	return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+			     struct ib_ucontext *context,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	struct iwch_create_cq_resp uresp;
+
+	PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries);
+	rhp = to_iwch_dev(ibdev);
+	chp = kzalloc(sizeof(*chp), GFP_KERNEL);
+	if (!chp)
+		return ERR_PTR(-ENOMEM);
+
+	if (t3a_device(rhp)) {
+
+		/*
+		 * T3A: Add some fluff to handle extra CQEs inserted 
+	 	 * for various errors.
+		 * Additional CQE possibilities:
+		 *      TERMINATE,
+		 *      incoming RDMA WRITE Failures
+		 *      incoming RDMA READ REQUEST FAILUREs
+		 * NOTE: We cannot ensure the CQ won't overflow.
+		 */
+		entries += 16; 
+	}
+	entries = roundup_pow_of_two(entries);
+	chp->cq.size_log2 = ilog2(entries);
+
+	if (cxio_create_cq(&rhp->rdev, &chp->cq)) {
+		kfree(chp);
+		return ERR_PTR(-ENOMEM);
+	}
+	chp->rhp = rhp;
+	chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1;
+	spin_lock_init(&chp->lock);
+	atomic_set(&chp->refcnt, 1);
+	init_waitqueue_head(&chp->wait);
+	insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid);
+
+	if (context) {
+		struct iwch_mm_entry *mm;
+
+		mm = kmalloc(sizeof *mm, GFP_KERNEL);
+		if (!mm) {
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-ENOMEM);
+		}
+		uresp.cqid = chp->cq.cqid;
+		uresp.size_log2 = chp->cq.size_log2;
+		uresp.physaddr = virt_to_phys(chp->cq.queue);
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm);
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-EFAULT);
+		}
+		mm->addr = uresp.physaddr;
+		mm->len = PAGE_ALIGN((1UL << uresp.size_log2) * 
+					     sizeof (struct t3_cqe));
+		insert_mmap(to_iwch_ucontext(context), mm);
+	}
+	PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n",
+	     chp->cq.cqid, chp, (1 << chp->cq.size_log2), 
+	     (u64)chp->cq.dma_addr);
+	return &chp->ibcq;
+}
+
+static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
+{
+	struct iwch_cq *chp = to_iwch_cq(cq);
+	struct t3_cq oldcq, newcq;
+	int ret;
+
+	PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe);
+
+	/* We don't downsize... */
+	if (cqe <= cq->cqe)
+		return 0;
+
+	/* create new t3_cq with new size */
+	cqe = roundup_pow_of_two(cqe+1);
+	newcq.size_log2 = ilog2(cqe);
+
+	/* Dont allow resize to less than the current wce count */
+	if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) {
+		return -ENOMEM;
+	}
+
+	/* Quiesce all QPs using this CQ */
+	ret = iwch_quiesce_qps(chp);
+	if (ret) {
+		return ret;
+	}
+
+	ret = cxio_create_cq(&chp->rhp->rdev, &newcq);
+	if (ret) {
+		kfree(chp);
+		return ret;
+	}
+	
+	/* copy CQEs */
+	memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) * 
+				        sizeof(struct t3_cqe));
+
+	/* old iwch_qp gets new t3_cq but keeps old cqid */
+	oldcq = chp->cq;
+	chp->cq = newcq;
+	chp->cq.cqid = oldcq.cqid;
+
+	/* resize new t3_cq to update the HW context */
+	ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq);
+	if (ret) {
+		chp->cq = oldcq;
+		return ret;
+	}
+	chp->ibcq.cqe = (1<<chp->cq.size_log2) - 1;
+
+	/* destroy old t3_cq */
+	oldcq.cqid = newcq.cqid;
+	ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq);
+	if (ret) {
+		printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n", 
+			__FUNCTION__, ret);
+	}
+	
+	/* add user hooks here */
+
+	/* resume qps */
+	ret = iwch_resume_qps(chp);
+	return ret;
+}
+
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	enum t3_cq_opcode cq_op;
+	int err;
+	unsigned long flag;
+	struct iwch_req_notify_cq ucmd;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+	if (notify == IB_CQ_SOLICITED)
+		cq_op = CQ_ARM_SE;
+	else
+		cq_op = CQ_ARM_AN;
+	if (udata && t3b_device(rhp)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+			return -EFAULT;
+		spin_lock_irqsave(&chp->lock, flag);
+		chp->cq.rptr = ucmd.rptr;
+	} else
+		spin_lock_irqsave(&chp->lock, flag);
+	PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr);
+	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
+	spin_unlock_irqrestore(&chp->lock, flag);
+	if (err) 
+		printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err, 
+		       chp->cq.cqid);
+	return err;
+}
+
+static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	int len = vma->vm_end - vma->vm_start;
+	u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT;
+	struct cxio_rdev *rdev_p;
+	int ret = 0;
+	struct iwch_mm_entry *mm;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff, 
+	     pgaddr, len);
+
+	if (vma->vm_start & (PAGE_SIZE-1)) {
+                return -EINVAL;
+        }
+
+	rdev_p = &(to_iwch_dev(context->device)->rdev);
+	ucontext = to_iwch_ucontext(context);
+
+	mm = remove_mmap(ucontext, pgaddr, len);
+	if (!mm)
+		return -EINVAL;
+	kfree(mm);
+
+	if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) && 
+	    (pgaddr < (rdev_p->rnic_info.udbell_physbase + 
+		       rdev_p->rnic_info.udbell_len))) {
+
+		/*
+		 * Map T3 DB register.
+		 */
+		if (vma->vm_flags & VM_READ) {
+                	return -EPERM;
+		}
+
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+		vma->vm_flags &= ~VM_MAYREAD;
+		ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	} else {
+
+		/*
+		 * Map WQ or CQ contig dma memory...
+		 */
+		ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	}
+	
+	return ret;
+}
+
+static int iwch_deallocate_pd(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid);
+	cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid);
+	kfree(php);
+	return 0;
+}
+
+static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev,
+			       struct ib_ucontext *context,
+			       struct ib_udata *udata)
+{
+	struct iwch_pd *php;
+	u32 pdid;
+	struct iwch_dev *rhp;
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	rhp = (struct iwch_dev *) ibdev;
+	pdid = cxio_hal_get_pdid(rhp->rdev.rscp);
+	if (!pdid)
+		return ERR_PTR(-EINVAL);
+	php = kzalloc(sizeof(*php), GFP_KERNEL);
+	if (!php) {
+		cxio_hal_put_pdid(rhp->rdev.rscp, pdid);
+		return ERR_PTR(-ENOMEM);
+	}
+	php->pdid = pdid;
+	php->rhp = rhp;
+	if (context) {
+		if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) {
+			iwch_deallocate_pd(&php->ibpd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php);
+	return &php->ibpd;
+}
+ 
+static int iwch_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mr *mhp;
+	u32 mmid;
+
+	PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr);
+	/* There can be no memory windows */
+	if (atomic_read(&ib_mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(ib_mr);
+	rhp = mhp->rhp;
+	mmid = mhp->attr.stag >> 8;
+	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size, 
+		       mhp->attr.pbl_addr);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	if (mhp->kva)
+		kfree((void *) (unsigned long) mhp->kva);
+	PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp);
+	kfree(mhp);
+	return 0;
+}
+
+static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
+					struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					int acc,
+					u64 *iova_start)
+{
+	__be64 *page_list;
+	int shift;
+	u64 total_size;
+	int npages;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	int ret;
+		
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+
+	acc = iwch_convert_access(acc);
+
+	
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
+			 	   &total_size, &npages, &shift, &page_list);
+	if (ret) 
+		goto err;
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+
+	/* NOTE: TPT perms are backwards from BIND WR perms! */
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+
+	mhp->attr.va_fbo = *iova_start;
+	mhp->attr.page_size = shift - 12;
+
+	mhp->attr.len = (u32) total_size;
+	mhp->attr.pbl_size = npages;
+	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		goto err;
+	}
+	return &mhp->ibmr;
+err:
+	kfree(mhp);
+	return ERR_PTR(ret);
+	
+}
+
+static int iwch_reregister_phys_mem(struct ib_mr *mr, 
+				     int mr_rereg_mask,
+				     struct ib_pd *pd,
+                                     struct ib_phys_buf *buffer_list,
+                                     int num_phys_buf,
+                                     int acc, u64 * iova_start)
+{
+
+	struct iwch_mr mh, *mhp;
+	struct iwch_pd *php;
+	struct iwch_dev *rhp;
+	int new_acc;
+	__be64 *page_list = NULL;
+	int shift = 0;
+	u64 total_size;
+	int npages;
+	int ret;
+
+	PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd);
+
+	/* There can be no memory windows */
+	if (atomic_read(&mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(mr);
+	rhp = mhp->rhp;
+	php = to_iwch_pd(mr->pd);
+
+	/* make sure we are on the same adapter */
+	if (rhp != php->rhp)
+		return -EINVAL;
+
+	new_acc = mhp->attr.perms;
+
+	memcpy(&mh, mhp, sizeof *mhp);
+
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		php = to_iwch_pd(pd);
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mh.attr.perms = iwch_convert_access(acc);
+	if (mr_rereg_mask & IB_MR_REREG_TRANS)
+		ret = build_phys_page_list(buffer_list, num_phys_buf, 
+					   iova_start,
+					   &total_size, &npages, 
+					   &shift, &page_list);
+
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages);
+	kfree(page_list);
+	if (ret) {
+		return ret;
+	}
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		mhp->attr.pdid = php->pdid;
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mhp->attr.perms = acc;
+	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		mhp->attr.zbva = 0;
+		mhp->attr.va_fbo = *iova_start;
+		mhp->attr.page_size = shift - 12;
+		mhp->attr.len = (u32) total_size;
+		mhp->attr.pbl_size = npages;
+	}
+
+	return 0;	
+}
+
+
+struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				      int acc, struct ib_udata *udata)
+{
+	__be64 *pages;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	struct iwch_reg_user_mr_resp uresp;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	shift = ffs(region->page_size) - 1;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	acc = iwch_convert_access(acc);
+
+	i = n = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = cpu_to_be64(sg_dma_address(
+					&chunk->page_list[j]) +
+					region->page_size * k);
+			}
+		}
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+	mhp->attr.va_fbo = region->virt_base;
+	mhp->attr.page_size = shift - 12;
+	mhp->attr.len = (u32) region->length;
+	mhp->attr.pbl_size = i;
+	err = iwch_register_mem(rhp, php, mhp, shift, pages);
+	kfree(pages);
+	if (err)
+		goto err;
+
+	if (udata && t3b_device(rhp)) {
+		uresp.pbl_addr = (mhp->attr.pbl_addr -
+                                 rhp->rdev.rnic_info.pbl_base) >> 3;
+		PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__, 
+		     uresp.pbl_addr);
+			
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			iwch_dereg_mr(&mhp->ibmr);
+			err = -EFAULT;
+			goto err;
+		}
+	}
+
+	return &mhp->ibmr;
+
+err:
+	kfree(mhp);
+	return ERR_PTR(err);
+}
+
+struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva;
+	struct ib_mr *ibmr;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+
+	/*
+	 * T3 only supports 32 bits of size.
+	 */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	kva = 0;
+	ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva);
+	return ibmr;
+}
+
+struct ib_mw *iwch_alloc_mw(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mw *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+	ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_MW;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag);
+	return &(mhp->ibmw);
+}
+
+int iwch_dealloc_mw(struct ib_mw *mw)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	u32 mmid;
+
+	mhp = to_iwch_mw(mw);
+	rhp = mhp->rhp;
+	mmid = (mw->rkey) >> 8;
+	cxio_deallocate_window(&rhp->rdev, mhp->attr.stag);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	kfree(mhp);
+	PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp);
+	return 0;
+}
+
+static int iwch_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_qp_attributes attrs;
+	struct iwch_ucontext *ucontext;
+
+	qhp = to_iwch_qp(ib_qp);
+	rhp = qhp->rhp;
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
+	}
+	wait_event(qhp->wait, !qhp->ep);
+
+	remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid);
+
+	atomic_dec(&qhp->refcnt);
+	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
+
+	ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context) 
+				  : NULL;
+	cxio_destroy_qp(&rhp->rdev, &qhp->wq, 
+			ucontext ? &ucontext->uctx : &rhp->rdev.uctx);
+
+	PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__, 
+	     ib_qp, qhp->wq.qpid, qhp);
+	kfree(qhp);
+	return 0;
+}
+
+static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *attrs,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_pd *php;
+	struct iwch_cq *schp;
+	struct iwch_cq *rchp;
+	struct iwch_create_qp_resp uresp;
+	int wqsize, sqsize, rqsize;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	if (attrs->qp_type != IB_QPT_RC) 
+		return ERR_PTR(-EINVAL);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
+	rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid);
+	if (!schp || !rchp)
+		return ERR_PTR(-EINVAL);
+
+	/* The RQT size must be # of entries + 1 rounded up to a power of two */
+	rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr);
+	if (rqsize == attrs->cap.max_recv_wr)
+		rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1);
+
+	/* T3 doesn't support RQT depth < 16 */
+	if (rqsize < 16)
+		rqsize = 16;
+
+	if (rqsize > T3_MAX_RQ_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	/* 
+	 * NOTE: The SQ and total WQ sizes don't need to be
+	 * a power of two.  However, all the code assumes 
+	 * they are. EG: Q_FREECNT() and friends.
+	 */
+	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
+	wqsize = roundup_pow_of_two(rqsize + sqsize);
+	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__, 
+	     wqsize, sqsize, rqsize);
+	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
+	if (!qhp)
+		return ERR_PTR(-ENOMEM);
+	qhp->wq.size_log2 = ilog2(wqsize);
+	qhp->wq.rq_size_log2 = ilog2(rqsize);
+	qhp->wq.sq_size_log2 = ilog2(sqsize);
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
+			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
+		kfree(qhp);
+		return ERR_PTR(-ENOMEM);
+	}
+	attrs->cap.max_recv_wr = rqsize - 1;
+	attrs->cap.max_send_wr = sqsize;
+	qhp->rhp = rhp;
+	qhp->attr.pd = php->pdid;
+	qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid;
+	qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid;
+	qhp->attr.sq_num_entries = attrs->cap.max_send_wr;
+	qhp->attr.rq_num_entries = attrs->cap.max_recv_wr;
+	qhp->attr.sq_max_sges = attrs->cap.max_send_sge;
+	qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge;
+	qhp->attr.rq_max_sges = attrs->cap.max_recv_sge;
+	qhp->attr.state = IWCH_QP_STATE_IDLE;
+	qhp->attr.next_state = IWCH_QP_STATE_IDLE;
+
+	/* 
+	 * XXX - These don't get passed in from the openib user
+ 	 * at create time.  The CM sets them via a QP modify.
+	 * Need to fix...  I think the CM should 
+	 */
+	qhp->attr.enable_rdma_read = 1;
+	qhp->attr.enable_rdma_write = 1;
+	qhp->attr.enable_bind = 1;
+	qhp->attr.max_ord = 1;
+	qhp->attr.max_ird = 1;
+
+	spin_lock_init(&qhp->lock);
+	init_waitqueue_head(&qhp->wait);
+	atomic_set(&qhp->refcnt, 1);
+	insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid);
+
+	if (udata) {
+
+		struct iwch_mm_entry *mm1, *mm2;
+
+		mm1 = kmalloc(sizeof *mm1, GFP_KERNEL);
+		if (!mm1) {
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		mm2 = kmalloc(sizeof *mm2, GFP_KERNEL);
+		if (!mm2) {
+			kfree(mm1);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		uresp.qpid = qhp->wq.qpid;
+		uresp.size_log2 = qhp->wq.size_log2;
+		uresp.sq_size_log2 = qhp->wq.sq_size_log2;
+		uresp.rq_size_log2 = qhp->wq.rq_size_log2;
+		uresp.physaddr = virt_to_phys(qhp->wq.queue);
+		uresp.doorbell = qhp->wq.udb;
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm1);
+			kfree(mm2);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-EFAULT);
+		}
+		mm1->addr = uresp.physaddr;
+		mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr));
+		insert_mmap(ucontext, mm1);
+		mm2->addr = uresp.doorbell & PAGE_MASK;
+		mm2->len = PAGE_SIZE;
+		insert_mmap(ucontext, mm2);
+	}
+	qhp->ibqp.qp_num = qhp->wq.qpid;
+	init_timer(&(qhp->timer));
+	PDBG("%s sq_num_entries %d, rq_num_entries %d "
+	     "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n",
+	     __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries,
+	     qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2);
+	return (&qhp->ibqp);
+}
+
+static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	enum iwch_qp_attr_mask mask = 0;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp);
+
+	/* iwarp does not support the RTR state */
+	if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR))
+		attr_mask &= ~IB_QP_STATE;
+
+	/* Make sure we still have something left to do */
+	if (!attr_mask)
+		return 0;
+
+	memset(&attrs, 0, sizeof attrs);
+	qhp = to_iwch_qp(ibqp);
+	rhp = qhp->rhp;
+
+	attrs.next_state = iwch_convert_state(attr->qp_state);
+	attrs.enable_rdma_read = (attr->qp_access_flags & 
+			       IB_ACCESS_REMOTE_READ) ?  1 : 0;
+	attrs.enable_rdma_write = (attr->qp_access_flags & 
+				IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
+	attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0;
+
+
+	mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0;
+	mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ? 
+			(IWCH_QP_ATTR_ENABLE_RDMA_READ |
+			 IWCH_QP_ATTR_ENABLE_RDMA_WRITE | 
+			 IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0;
+
+	return iwch_modify_qp(rhp, qhp, mask, &attrs, 0);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	atomic_inc(&(to_iwch_qp(qp)->refcnt));
+}
+
+void iwch_qp_rem_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt)))
+                wake_up(&(to_iwch_qp(qp)->wait));
+}
+
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn)
+{
+	PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn);
+	return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn);
+}
+
+
+static int iwch_query_pkey(struct ib_device *ibdev,
+			   u8 port, u16 index, u16 * pkey)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	*pkey = 0;
+	return 0;
+}
+
+static int iwch_query_gid(struct ib_device *ibdev, u8 port,
+			  int index, union ib_gid *gid)
+{
+	struct iwch_dev *dev;
+
+	PDBG("%s ibdev %p, port %d, index %d, gid %p\n",
+	       __FUNCTION__, ibdev, port, index, gid);
+	dev = to_iwch_dev(ibdev);
+	BUG_ON(port == 0 || port > 2);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6);
+	return 0;
+}
+
+static int iwch_query_device(struct ib_device *ibdev,
+			     struct ib_device_attr *props)
+{
+
+	struct iwch_dev *dev;
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+
+	dev = to_iwch_dev(ibdev);
+	memset(props, 0, sizeof *props);
+	memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	props->device_cap_flags = dev->device_cap_flags;
+	props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor;
+	props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device;
+	props->max_mr_size = ~0ull;
+	props->max_qp = dev->attr.max_qps;
+	props->max_qp_wr = dev->attr.max_wrs;
+	props->max_sge = dev->attr.max_sge_per_wr;
+	props->max_sge_rd = 1;
+	props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp;
+	props->max_cq = dev->attr.max_cqs;
+	props->max_cqe = dev->attr.max_cqes_per_cq;
+	props->max_mr = dev->attr.max_mem_regs;
+	props->max_pd = dev->attr.max_pds;
+	props->local_ca_ack_delay = 0;
+
+	return 0;
+}
+
+static int iwch_query_port(struct ib_device *ibdev,
+			   u8 port, struct ib_port_attr *props)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_SNMP_TUNNEL_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_DEVICE_MGMT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 2;
+	props->active_speed = 2;
+	props->max_msg_sz = -1;
+
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.fw_version);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.driver);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev, 
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, dev);
+	return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor,
+		                       dev->rdev.rnic_info.pdev->device);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *iwch_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+int iwch_register_device(struct iwch_dev *dev)
+{
+	int ret;
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->device_cap_flags =
+	    (IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
+	dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
+	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.query_device = iwch_query_device;
+	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.modify_port = iwch_modify_port;
+	dev->ibdev.query_pkey = iwch_query_pkey;
+	dev->ibdev.query_gid = iwch_query_gid;
+	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext;
+	dev->ibdev.mmap = iwch_mmap;
+	dev->ibdev.alloc_pd = iwch_allocate_pd;
+	dev->ibdev.dealloc_pd = iwch_deallocate_pd;
+	dev->ibdev.create_ah = iwch_ah_create;
+	dev->ibdev.destroy_ah = iwch_ah_destroy;
+	dev->ibdev.create_qp = iwch_create_qp;
+	dev->ibdev.modify_qp = iwch_ib_modify_qp;
+	dev->ibdev.destroy_qp = iwch_destroy_qp;
+	dev->ibdev.create_cq = iwch_create_cq;
+	dev->ibdev.destroy_cq = iwch_destroy_cq;
+	dev->ibdev.resize_cq = iwch_resize_cq;
+	dev->ibdev.poll_cq = iwch_poll_cq;
+	dev->ibdev.get_dma_mr = iwch_get_dma_mr;
+	dev->ibdev.reg_phys_mr = iwch_register_phys_mem;
+	dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem;
+	dev->ibdev.reg_user_mr = iwch_reg_user_mr;
+	dev->ibdev.dereg_mr = iwch_dereg_mr;
+	dev->ibdev.alloc_mw = iwch_alloc_mw;
+	dev->ibdev.bind_mw = iwch_bind_mw;
+	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+
+	dev->ibdev.attach_mcast = iwch_multicast_attach;
+	dev->ibdev.detach_mcast = iwch_multicast_detach;
+	dev->ibdev.process_mad = iwch_process_mad;
+
+	dev->ibdev.req_notify_cq = iwch_arm_cq;
+	dev->ibdev.post_send = iwch_post_send;
+	dev->ibdev.post_recv = iwch_post_receive;
+
+
+	dev->ibdev.iwcm =
+	    (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs),
+					   GFP_KERNEL);
+	dev->ibdev.iwcm->connect = iwch_connect;
+	dev->ibdev.iwcm->accept = iwch_accept_cr;
+	dev->ibdev.iwcm->reject = iwch_reject_cr;
+	dev->ibdev.iwcm->create_listen = iwch_create_listen;
+	dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen;
+	dev->ibdev.iwcm->add_ref = iwch_qp_add_ref;
+	dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref;
+	dev->ibdev.iwcm->get_qp = iwch_get_qp;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		goto bail1;
+
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       iwch_class_attributes[i]);
+		if (ret) {
+			goto bail2;
+		}
+	}
+	return 0;
+bail2:
+	ib_unregister_device(&dev->ibdev);
+bail1:
+	return ret;
+}
+
+void iwch_unregister_device(struct iwch_dev *dev)
+{
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i)
+		class_device_remove_file(&dev->ibdev.class_dev,
+					 iwch_class_attributes[i]);
+	ib_unregister_device(&dev->ibdev);
+	return;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
new file mode 100644
index 0000000..4d98886
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -0,0 +1,363 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_PROVIDER_H__
+#define __IWCH_PROVIDER_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include <asm/types.h>
+#include "t3cdev.h"
+#include "iwch.h"
+#include "cxio_wr.h"
+#include "cxio_hal.h"
+
+struct iwch_pd {
+	struct ib_pd ibpd;
+	u32 pdid;
+	struct iwch_dev *rhp;
+};
+
+static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct iwch_pd, ibpd);
+}
+
+struct tpt_attributes {
+	u32 stag;
+	u32 state:1;
+	u32 type:2;
+	u32 rsvd:1;
+	enum tpt_mem_perm perms;
+	u32 remote_invaliate_disable:1;
+	u32 zbva:1;
+	u32 mw_bind_enable:1;
+	u32 page_size:5;
+
+	u32 pdid;
+	u32 qpid;
+	u32 pbl_addr;
+	u32 len;
+	u64 va_fbo;
+	u32 pbl_size;
+};
+
+struct iwch_mr {
+	struct ib_mr ibmr;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+typedef struct iwch_mw iwch_mw_handle;
+
+static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct iwch_mr, ibmr);
+}
+
+struct iwch_mw {
+	struct ib_mw ibmw;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw)
+{
+	return container_of(ibmw, struct iwch_mw, ibmw);
+}
+
+struct iwch_cq {
+	struct ib_cq ibcq;
+	struct iwch_dev *rhp;
+	struct t3_cq cq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+};
+
+static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct iwch_cq, ibcq);
+}
+
+enum IWCH_QP_FLAGS {
+	QP_QUIESCED = 0x01
+};
+
+struct iwch_mpa_attributes {
+	u8 recv_marker_enabled;
+	u8 xmit_marker_enabled;	/* iWARP: enable inbound Read Resp. */
+	u8 crc_enabled;
+	u8 version;	/* 0 or 1 */
+};
+
+struct iwch_qp_attributes {
+	u32 scq;
+	u32 rcq;
+	u32 sq_num_entries;
+	u32 rq_num_entries;
+	u32 sq_max_sges;
+	u32 sq_max_sges_rdma_write;
+	u32 rq_max_sges;
+	u32 state;
+	u8 enable_rdma_read;
+	u8 enable_rdma_write;	/* enable inbound Read Resp. */
+	u8 enable_bind;
+	u8 enable_mmid0_fastreg;	/* Enable STAG0 + Fast-register */
+	/*
+	 * Next QP state. If specify the current state, only the 
+	 * QP attributes will be modified.
+	 */
+	u32 max_ord;
+	u32 max_ird;
+	u32 pd;	/* IN */
+	u32 next_state;
+	char terminate_buffer[52];
+	u32 terminate_msg_len;
+	u8 is_terminate_local;
+	struct iwch_mpa_attributes mpa_attr;	/* IN-OUT */
+	struct iwch_ep *llp_stream_handle;
+	char *stream_msg_buf;	/* Last stream msg. before Idle -> RTS */
+	u32 stream_msg_buf_len;	/* Only on Idle -> RTS */
+};
+
+struct iwch_qp {
+	struct ib_qp ibqp;
+	struct iwch_dev *rhp;
+	struct iwch_ep *ep;
+	struct iwch_qp_attributes attr;
+	struct t3_wq wq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+	enum IWCH_QP_FLAGS flags;
+	struct timer_list timer;
+};
+
+static inline int qp_quiesced(struct iwch_qp *qhp)
+{
+	return (qhp->flags & QP_QUIESCED);
+}
+
+static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct iwch_qp, ibqp);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp);
+void iwch_qp_rem_ref(struct ib_qp *qp);
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn);
+
+struct iwch_ucontext {
+	struct ib_ucontext ibucontext;
+	struct cxio_ucontext uctx;
+	spinlock_t mmap_lock;
+	struct list_head mmaps;
+};
+
+static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c)
+{
+	return container_of(c, struct iwch_ucontext, ibucontext);
+}
+
+struct iwch_mm_entry {
+	struct list_head entry;
+	u64 addr;
+	unsigned len;
+};
+
+static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext,
+						u64 addr, unsigned len)
+{
+	struct list_head *pos, *nxt;
+	struct iwch_mm_entry *mm;
+
+	spin_lock_irq(&ucontext->mmap_lock);
+	list_for_each_safe(pos, nxt, &ucontext->mmaps) {
+		
+		mm = list_entry(pos, struct iwch_mm_entry, entry);
+		if (mm->addr == addr && mm->len == len) {
+			list_del_init(&mm->entry);
+			spin_unlock_irq(&ucontext->mmap_lock);
+			PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, 
+			     mm->len);
+			return mm;
+		}
+	}
+	spin_unlock_irq(&ucontext->mmap_lock);
+	return NULL;
+}
+
+static inline void insert_mmap(struct iwch_ucontext *ucontext, 
+			       struct iwch_mm_entry *mm)
+{
+	spin_lock_irq(&ucontext->mmap_lock);
+	PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len);
+	list_add_tail(&mm->entry, &ucontext->mmaps);
+	spin_unlock_irq(&ucontext->mmap_lock);
+}
+
+enum iwch_qp_attr_mask {
+	IWCH_QP_ATTR_NEXT_STATE = 1 << 0,
+	IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
+	IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
+	IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
+	IWCH_QP_ATTR_MAX_ORD = 1 << 11,
+	IWCH_QP_ATTR_MAX_IRD = 1 << 12,
+	IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22,
+	IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23,
+	IWCH_QP_ATTR_MPA_ATTR = 1 << 24,
+	IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25,
+	IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ |
+				     IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+				     IWCH_QP_ATTR_MAX_ORD |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_STREAM_MSG_BUFFER |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE)
+};
+
+int iwch_modify_qp(struct iwch_dev *rhp,
+				struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal);
+
+enum iwch_qp_state {
+	IWCH_QP_STATE_IDLE,
+	IWCH_QP_STATE_RTS,
+	IWCH_QP_STATE_ERROR,
+	IWCH_QP_STATE_TERMINATE,
+	IWCH_QP_STATE_CLOSING,
+	IWCH_QP_STATE_TOT
+};
+
+static inline int iwch_convert_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+	case IB_QPS_INIT:
+		return IWCH_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return IWCH_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return IWCH_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return IWCH_QP_STATE_TERMINATE;
+	case IB_QPS_ERR:
+		return IWCH_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+enum iwch_mem_perms {
+	IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0,
+	IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1,
+	IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2,
+	IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3,
+	IWCH_MEM_ACCESS_ATOMICS = 1 << 4,
+	IWCH_MEM_ACCESS_BINDING = 1 << 5,
+	IWCH_MEM_ACCESS_LOCAL =
+	    (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE),
+	IWCH_MEM_ACCESS_REMOTE =
+	    (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ)
+	    /* cannot go beyond 1 << 31 */
+} __attribute__ ((packed));
+
+static inline u32 iwch_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0)
+	    | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) |
+	    (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) |
+	    IWCH_MEM_ACCESS_LOCAL_READ;
+}
+
+enum iwch_mmid_state {
+	IWCH_STAG_STATE_VALID,
+	IWCH_STAG_STATE_INVALID
+};
+
+enum iwch_qp_query_flags {
+	IWCH_QP_QUERY_CONTEXT_NONE = 0x0,	/* No ctx; Only attrs */
+	IWCH_QP_QUERY_CONTEXT_GET = 0x1,	/* Get ctx + attrs */
+	IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2,	/* Not Supported */
+
+	/* 
+	 * Quiesce QP context; Consumer 
+	 * will NOT replay outstanding WR
+	 */
+	IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4,
+	IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8,
+	IWCH_QP_QUERY_TEST_USERWRITE = 0x32	/* Test special */
+};
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr);
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr);
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind);
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg);
+int iwch_register_device(struct iwch_dev *dev);
+void iwch_unregister_device(struct iwch_dev *dev);
+int iwch_quiesce_qps(struct iwch_cq *chp);
+int iwch_resume_qps(struct iwch_cq *chp);
+void stop_read_rep_timer(struct iwch_qp *qhp);
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list);
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages);
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list);
+
+
+#define IWCH_NODE_DESC "cxgb3 Chelsio Communications"
+
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h
new file mode 100644
index 0000000..4e4b9c9
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_user.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_USER_H__
+#define __IWCH_USER_H__
+
+#define IWCH_UVERBS_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct iwch_create_cq_resp {
+	__u64 physaddr;		
+	__u32 cqid;
+	__u32 size_log2;
+};
+
+struct iwch_create_qp_resp {
+	__u64 physaddr;
+	__u64 doorbell;	
+	__u32 qpid;
+	__u32 size_log2;
+	__u32 sq_size_log2;
+	__u32 rq_size_log2;
+};
+
+struct iwch_reg_user_mr_resp {
+	__u32 pbl_addr;
+};
+
+struct iwch_req_notify_cq {
+	__u32 rptr;
+};
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:54:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:54:36 -0600
Subject: [openib-general] [PATCH  v4 04/13] Connection Manager
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135435.21159.92185.stgit@dell3.ogc.int>


This code implements the iWARP CM provider methods for the Chelsio driver.
The Chelsio ULLD is used to setup and teardown TCP connections, and the
T3 RDMA Core is used to move the connections in and out of RDMA mode.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c | 2058 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h |  223 ++++
 drivers/infiniband/hw/cxgb3/tcb.h     |  603 ++++++++++
 3 files changed, 2884 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
new file mode 100644
index 0000000..962618f
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -0,0 +1,2058 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/skbuff.h>
+#include <linux/timer.h>
+#include <linux/notifier.h>
+
+#include <net/neighbour.h>
+#include <net/netevent.h>
+#include <net/route.h>
+
+#include "tcb.h"
+#include "cxgb3_offload.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+
+char *states[] = {
+	"idle",
+	"listen",
+	"connecting",
+	"mpa_wait_req",
+	"mpa_req_sent",
+	"mpa_req_rcvd",
+	"mpa_rep_sent",
+	"fpdu_mode",
+	"aborting",
+	"closing",
+	"moribund",
+	"dead",
+	NULL,
+};
+
+static int ep_timeout_secs = 10;
+module_param(ep_timeout_secs, int, 0444);
+MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
+				   "in seconds (default=10)");
+
+static int mpa_rev = 1;
+module_param(mpa_rev, int, 0444);
+MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
+		 "1 is spec compliant. (default=1)");
+
+static int markers_enabled = 0;
+module_param(markers_enabled, int, 0444);
+MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
+
+static int crc_enabled = 1;
+module_param(crc_enabled, int, 0444);
+MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
+
+static int rcv_win = 512 * 1024;
+module_param(rcv_win, int, 0444);
+MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)");
+
+static int snd_win = 512 * 1024;
+module_param(snd_win, int, 0444);
+MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)");
+
+static unsigned int nocong = 1;
+module_param(nocong, uint, 0444);
+MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)");
+
+static void process_work(struct work_struct *work);
+static struct workqueue_struct *workq;
+DECLARE_WORK(skb_work, process_work);
+
+static struct sk_buff_head rxq;
+static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS];
+
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp);
+static void ep_timeout(unsigned long arg);
+static void connect_reply_upcall(struct iwch_ep *ep, int status);
+
+static void start_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	if (timer_pending(&ep->timer)) {
+		PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep);
+		del_timer_sync(&ep->timer);
+	} else
+		get_ep(&ep->com);
+	ep->timer.expires = jiffies + ep_timeout_secs * HZ;
+	ep->timer.data = (unsigned long)ep;
+	ep->timer.function = ep_timeout;
+	add_timer(&ep->timer);
+}
+
+static void stop_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	del_timer_sync(&ep->timer);
+	put_ep(&ep->com);
+}
+
+static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
+{
+	struct cpl_tid_release *req;
+
+	skb = get_skb(skb, sizeof *req, GFP_KERNEL);
+	if (!skb)
+		return;
+	req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
+	skb->priority = CPL_PRIORITY_SETUP;
+	tdev->send(tdev, skb);
+	return;
+}
+
+int iwch_quiesce_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+int iwch_resume_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = 0;
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static void set_emss(struct iwch_ep *ep, u16 opt)
+{
+	PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt);
+	ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40;
+	if (G_TCPOPT_TSTAMP(opt))
+		ep->emss -= 12;
+	if (ep->emss < 128)
+		ep->emss = 128;
+	PDBG("emss=%d\n", ep->emss);
+}
+
+static int state_comp_exch(struct iwch_ep_common *epc,
+			   enum iwch_ep_state comp, 
+			   enum iwch_ep_state exch)
+{
+        unsigned long flags;
+        int ret;
+
+        spin_lock_irqsave(&epc->lock, flags);
+        ret = (epc->state == comp);
+        if (ret)
+                epc->state = exch;
+        spin_unlock_irqrestore(&epc->lock, flags);
+        return ret;
+}
+
+static enum iwch_ep_state state_read(struct iwch_ep_common *epc)
+{
+	unsigned long flags;
+	enum iwch_ep_state state;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	state = epc->state;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return state;
+}
+
+static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], 
+		states[new]);
+	epc->state = new;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return;
+}
+
+static void *alloc_ep(int size, gfp_t gfp)
+{
+	struct iwch_ep_common *epc;
+
+	epc = kmalloc(size, gfp);
+	if (epc) {
+		memset(epc, 0, size);
+		kref_init(&epc->kref);
+		spin_lock_init(&epc->lock);
+		init_waitqueue_head(&epc->waitq);
+	}
+	PDBG("%s alloc ep %p\n", __FUNCTION__, epc);
+	return (void *) epc;
+}
+
+void __free_ep(struct kref *kref) 
+{
+	struct iwch_ep_common *epc;
+	epc = container_of(kref, struct iwch_ep_common, kref);
+	PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]);
+	kfree(epc);
+}
+
+static void release_ep_resources(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	state_set(&ep->com, DEAD);
+	cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, ep->hwtid, NULL);
+	put_ep(&ep->com);
+}
+
+static void process_work(struct work_struct *work)
+{
+	struct sk_buff *skb = NULL;
+	void *ep;
+	struct t3cdev *tdev;
+	int ret;
+
+	while ((skb = skb_dequeue(&rxq))) {
+		ep = *((void **) (skb->cb));
+		tdev = *((struct t3cdev **) (skb->cb + sizeof(void *)));
+		ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep);
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+
+		/* 
+		 * ep was referenced in sched(), and is freed here.
+		 */
+		put_ep((struct iwch_ep_common *)ep);
+	}
+}
+
+static int status2errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_NONE:
+		return 0;
+	case CPL_ERR_CONN_RESET:
+		return -ECONNRESET;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Try and reuse skbs already allocated...
+ */
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp)
+{
+	if (skb) {
+		BUG_ON(skb_cloned(skb));
+		skb_trim(skb, 0);
+		skb_get(skb);
+	} else {
+		skb = alloc_skb(len, gfp);
+	}
+	return skb;
+}
+
+static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip, 
+				 __be32 peer_ip, __be16 local_port,
+				 __be16 peer_port, u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = 0,
+		.nl_u = {
+			 .ip4_u = {
+				   .daddr = peer_ip,
+				   .saddr = local_ip,
+				   .tos = tos}
+			 },
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			  .ports = {
+				    .sport = local_port,
+				    .dport = peer_port}
+			  }
+	};
+
+	if (ip_route_output_flow(&rt, &fl, NULL, 0))
+		return NULL;
+	return rt;
+}
+
+static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu)
+		++i;
+	return i;
+}
+
+static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for an active open.   
+ */
+static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR MOD "ARP failure duing connect\n");
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for a CPL_ABORT_REQ.  Change it into a no RST variant
+ * and send it along.
+ */
+static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_abort_req *req = cplhdr(skb);
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb3_ofld_send(dev, skb);
+}
+
+static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
+{
+	struct cpl_close_con_req *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	struct cpl_abort_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(skb, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, abort_arp_failure);
+	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
+	req->cmd = CPL_ABORT_SEND_RST;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_connect(struct iwch_ep *ep)
+{
+	struct cpl_act_open_req *req;
+	struct sk_buff *skb;
+	u32 opt0h, opt0l, opt2;
+	unsigned int mtu_idx;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+	skb->priority = CPL_PRIORITY_SETUP;
+	set_arp_failure_handler(skb, act_open_req_arp_failure);
+
+	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->peer_port = ep->com.remote_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
+	req->opt0h = htonl(opt0h);
+	req->opt0l = htonl(opt0l);
+	req->params = 0;
+	req->opt2 = htonl(opt2);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+
+	PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen);
+
+	BUG_ON(skb_cloned(skb));
+
+	mpalen = sizeof(*mpa) + ep->plen;
+	if (skb->data + mpalen + sizeof(*req) > skb->end) {
+		kfree_skb(skb);
+		skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL);
+		if (!skb) {
+			connect_reply_upcall(ep, -ENOMEM);
+			return;
+		}
+	}
+	skb_trim(skb, 0);
+	skb_reserve(skb, sizeof(*req));
+	skb_put(skb, mpalen);
+	skb->priority = CPL_PRIORITY_DATA;
+	mpa = (struct mpa_message *) skb->data;
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key));
+	mpa->flags = (crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->private_data_size = htons(ep->plen);
+	mpa->revision = mpa_rev;
+
+	if (ep->plen)
+		memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	start_ep_timer(ep);
+	state_set(&ep->com, MPA_REQ_SENT);
+	return;
+}
+
+static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = MPA_REJECT;
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb again.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(mpalen);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) | 
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/* 
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.  
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	ep->mpa_skb = skb;
+	state_set(&ep->com, MPA_REP_SENT);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int tid = GET_TID(req);
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid);
+
+	dst_confirm(ep->dst);
+
+	/* setup the hwtid for this connection */
+	ep->hwtid = tid;
+	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
+
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	/* dealloc the atid */
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+
+	/* start MPA negotiation */
+	send_mpa_req(ep, skb);
+
+	return 0;
+}
+
+static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	PDBG("%s ep %p\n", __FILE__, ep);
+	state_set(&ep->com, ABORTING);
+	send_abort(ep, skb, GFP_KERNEL);
+}
+
+static void close_complete_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	if (ep->com.cm_id) {
+		PDBG("close complete delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void peer_close_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_DISCONNECT;
+	if (ep->com.cm_id) {
+		PDBG("peer close delivered ep %p cm_id %p tid %d\n", 
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static void peer_abort_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	event.status = -ECONNRESET;
+	if (ep->com.cm_id) {
+		PDBG("abort delivered ep %p cm_id %p tid %d\n", ep,
+		     ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_reply_upcall(struct iwch_ep *ep, int status)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REPLY;
+	event.status = status;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+
+	if ((status == 0) || (status == -ECONNREFUSED)) {
+		event.private_data_len = ep->plen;
+		event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	}
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep, 
+		     ep->hwtid, status);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+	if (status < 0) {
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_request_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REQUEST;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+	event.private_data_len = ep->plen;
+	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	event.provider_data = ep;
+	if (state_read(&ep->parent_ep->com) != DEAD)
+		ep->parent_ep->com.cm_id->event_handler(
+						ep->parent_ep->com.cm_id,
+						&event);
+	put_ep(&ep->parent_ep->com);
+	ep->parent_ep = NULL;
+}
+
+static void established_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_ESTABLISHED;
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static int update_rx_credits(struct iwch_ep *ep, u32 credits)
+{
+	struct cpl_rx_data_ack *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n");
+		return 0;
+	}
+
+	req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
+	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
+	skb->priority = CPL_PRIORITY_ACK;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return credits;
+}
+
+static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	int err;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+	state_set(&ep->com, FPDU_MODE);
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * if we don't even have the mpa message, then bail. 
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* Validate MPA header. */
+	if (mpa->revision != mpa_rev) {
+		err = -EPROTO;
+		goto err;
+	}
+	if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	if (mpa->flags & MPA_REJECT) {
+		err = -ECONNREFUSED;
+		goto err;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data. And
+	 * the MPA header is valid.
+	 */
+
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ird;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+	    IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR |
+	    IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD;
+
+	/* bind QP and TID with INIT_WR */
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+	if (!err)
+		goto out;
+err:
+	abort_connection(ep, skb);
+out:
+	connect_reply_upcall(ep, err);
+	return;
+}
+
+static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/* 
+ 	 * Stop mpa timer.  If it expired, then the state is
+	 * CLOSING and we bail since ep_timeout already aborted 
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) == CLOSING)
+		return;
+
+	/* 
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/*
+	 * Copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/* 
+	 * If we don't even have the mpa message, then bail. 
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* 
+	 * Validate MPA Header.
+	 */
+	if (mpa->revision != mpa_rev) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/* 
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		abort_connection(ep, skb);
+		return;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		abort_connection(ep, skb);
+		return;
+	}
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data.
+	 */
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	state_set(&ep->com, MPA_REQ_RCVD);
+
+	/* drive upcall */
+	connect_request_upcall(ep);
+	return;
+}
+
+static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_rx_data *hdr = cplhdr(skb);
+	unsigned int dlen = ntohs(hdr->len);
+
+	PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen);
+
+	skb_pull(skb, sizeof(*hdr));
+	skb_trim(skb, dlen);
+
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_SENT:
+		process_mpa_reply(ep, skb);
+		break;
+	case MPA_REQ_WAIT:
+		process_mpa_request(ep, skb);
+		break;
+	case MPA_REP_SENT:
+		break;
+	default:
+		printk(KERN_ERR MOD "%s Unexpected streaming data."
+		       " ep %p state %d tid %d\n",
+		       __FUNCTION__, ep, state_read(&ep->com), ep->hwtid);
+
+		/* 
+	 	 * The ep will timeout and inform the ULP of the failure.
+		 * See ep_timeout().
+	 	 */
+		break;
+	}
+
+	/* update RX credits */
+	update_rx_credits(ep, dlen);
+
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Upcall from the adapter indicating data has been transmitted.
+ * For us its just the single MPA request or reply.  We can now free
+ * the skb holding the mpa message.
+ */
+static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_wr_ack *hdr = cplhdr(skb);
+	unsigned int credits = ntohs(hdr->credits);
+	enum iwch_qp_attr_mask  mask;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+
+	if (credits == 0)
+		return CPL_RET_BUF_DONE;
+	BUG_ON(credits != 1);
+	BUG_ON(ep->mpa_skb == NULL);
+	kfree_skb(ep->mpa_skb);
+	ep->mpa_skb = NULL;
+	dst_confirm(ep->dst);
+	if (state_read(&ep->com) == MPA_REP_SENT) {
+		struct iwch_qp_attributes attrs;
+
+		/* bind QP to EP and move to RTS */
+		attrs.mpa_attr = ep->mpa_attr;
+		attrs.max_ird = ep->ord;
+		attrs.max_ord = ep->ord;
+		attrs.llp_stream_handle = ep;
+		attrs.next_state = IWCH_QP_STATE_RTS;
+
+		/* bind QP and TID with INIT_WR */
+		mask = IWCH_QP_ATTR_NEXT_STATE |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_MAX_ORD;
+
+		ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, mask, &attrs, 1);
+
+		if (!ep->com.rpl_err) {
+			state_set(&ep->com, FPDU_MODE);
+			established_upcall(ep);
+		}
+
+		ep->com.rpl_done = 1;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	close_complete_upcall(ep);
+	release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status,
+	     status2errno(rpl->status));
+	connect_reply_upcall(ep, status2errno(rpl->status));
+	state_set(&ep->com, DEAD);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, GET_TID(rpl), NULL);
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	put_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_start(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_pass_open_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n");
+		return -ENOMEM;
+	}
+
+	req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_port = 0;
+	req->peer_ip = 0;
+	req->peer_netmask = 0;
+	req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS);
+	req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10));
+	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
+
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_pass_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep, 
+	     rpl->status, status2errno(rpl->status));
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_stop(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_close_listserv_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
+			     void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_close_listserv_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+	return CPL_RET_BUF_DONE;
+}
+
+static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+	unsigned int mtu_idx;
+	u32 opt0h, opt0l, opt2;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+
+	rpl = cplhdr(skb);
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(opt0h);
+	rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT);
+	rpl->opt2 = htonl(opt2);
+	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
+	skb->priority = CPL_PRIORITY_SETUP;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+
+	return;
+}
+
+static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
+		      struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid, 
+	     peer_ip);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(struct cpl_tid_release));
+	skb_get(skb);
+
+	if (tdev->type == T3B)
+		release_tid(tdev, hwtid, skb);
+	else {
+		struct cpl_pass_accept_rpl *rpl;
+
+		rpl = cplhdr(skb);
+		skb->priority = CPL_PRIORITY_SETUP;
+		rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+		OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, 
+						      hwtid));
+		rpl->peer_ip = peer_ip;
+		rpl->opt0h = htonl(F_TCAM_BYPASS);
+		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
+		rpl->opt2 = 0;
+		rpl->rsvd = rpl->opt2;
+		tdev->send(tdev, skb);
+	}
+}
+
+static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *child_ep, *parent_ep = ctx;
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int hwtid = GET_TID(req);
+	struct dst_entry *dst;
+	struct l2t_entry *l2t;
+	struct rtable *rt;
+	struct iff_mac tim;
+
+	PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid);
+
+	if (state_read(&parent_ep->com) != LISTEN) {
+		printk(KERN_ERR "%s - listening ep not in LISTEN\n", 
+		       __FUNCTION__);
+		goto reject;
+	}
+
+	/*
+	 * Find the netdev for this connection request.
+	 */
+	tim.mac_addr = req->dst_mac;
+	tim.vlan_tag = ntohs(req->vlan_tag);
+	if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) {
+		printk(KERN_ERR 
+			"%s bad dst mac %02x %02x %02x %02x %02x %02x\n",
+			__FUNCTION__,
+			req->dst_mac[0],
+			req->dst_mac[1],
+			req->dst_mac[2],
+			req->dst_mac[3],
+			req->dst_mac[4],
+			req->dst_mac[5]);
+		goto reject;
+	}
+
+	/* Find output route */
+	rt = find_route(tdev,
+			req->local_ip,
+			req->peer_ip,
+			req->local_port,
+			req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid)));
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - failed to find dst entry!\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	dst = &rt->u.dst;
+	l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev);
+	if (!l2t) {
+		printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n",
+		       __FUNCTION__);
+		dst_release(dst);
+		goto reject;
+	}
+	child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL);
+	if (!child_ep) {
+		printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n",
+		       __FUNCTION__);
+		l2t_release(L2DATA(tdev), l2t);
+		dst_release(dst);
+		goto reject;
+	}
+	state_set(&child_ep->com, CONNECTING);
+	child_ep->com.tdev = tdev;
+	child_ep->com.cm_id = NULL;
+	child_ep->com.local_addr.sin_family = PF_INET;
+	child_ep->com.local_addr.sin_port = req->local_port;
+	child_ep->com.local_addr.sin_addr.s_addr = req->local_ip;
+	child_ep->com.remote_addr.sin_family = PF_INET;
+	child_ep->com.remote_addr.sin_port = req->peer_port;
+	child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip;
+	get_ep(&parent_ep->com);
+	child_ep->parent_ep = parent_ep;
+	child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid));
+	child_ep->l2t = l2t;
+	child_ep->dst = dst;
+	child_ep->hwtid = hwtid;
+	init_timer(&child_ep->timer);
+	cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid);
+	accept_cr(child_ep, req->peer_ip, skb);
+	goto out;
+reject:
+	reject_cr(tdev, hwtid, req->peer_ip, skb);
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_pass_establish *req = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	dst_confirm(ep->dst);
+	state_set(&ep->com, MPA_REQ_WAIT);
+	start_ep_timer(ep);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int abort = 0;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	dst_confirm(ep->dst);
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_WAIT:
+		state_set(&ep->com, CLOSING);
+		break;
+	case MPA_REQ_SENT:
+		state_set(&ep->com, CLOSING);
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REQ_RCVD:
+
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		state_set(&ep->com, CLOSING);
+		get_ep(&ep->com);
+		break;
+	case MPA_REP_SENT:
+		state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case FPDU_MODE:
+		state_set(&ep->com, CLOSING);
+		peer_close_upcall(ep);
+		attrs.next_state = IWCH_QP_STATE_CLOSING;
+		ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		if (ret) {
+			printk(KERN_ERR MOD "%s - qp <- closing err!\n",
+			       __FUNCTION__);
+			abort = 1;
+		}
+		break;
+	case ABORTING:
+		goto out;
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		goto out;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+				       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				       &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		goto out;
+	case DEAD:
+		goto out;
+	default:
+		BUG_ON(1);
+	}
+	iwch_ep_disconnect(ep, abort, GFP_KERNEL);	
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+        return status == CPL_ERR_RTX_NEG_ADVICE ||
+               status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_abort_req_rss *req = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+	struct cpl_abort_rpl *rpl;
+	struct sk_buff *rpl_skb;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int state;
+
+	if (is_neg_adv_abort(req->status)) {
+		PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep, 
+		     ep->hwtid);
+		t3_l2t_send_event(ep->com.tdev, ep->l2t);
+		return CPL_RET_BUF_DONE;
+	}
+
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state);
+	switch (state) {
+	case CONNECTING:
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REP_SENT:
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case MPA_REQ_RCVD:
+	
+		/* 
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		get_ep(&ep->com);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+	case FPDU_MODE:
+	case CLOSING:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+			if (ret)
+				printk(KERN_ERR MOD 
+				       "%s - qp <- error failed!\n",
+				       __FUNCTION__);
+		}
+		peer_abort_upcall(ep);
+		break;
+	case ABORTING:
+		break;
+	case DEAD:
+		PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__);
+		return CPL_RET_BUF_DONE;
+	default:
+		BUG_ON(1);
+		break;
+	}
+	dst_confirm(ep->dst);
+	
+	rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL);
+	if (!rpl_skb) {
+		printk(KERN_ERR MOD "%s - cannot allocate skb!\n",
+		       __FUNCTION__);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		put_ep(&ep->com);
+		return CPL_RET_BUF_DONE;
+	}
+	rpl_skb->priority = CPL_PRIORITY_DATA;
+	rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl));
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL));
+	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
+	rpl->cmd = CPL_ABORT_NO_RST;
+	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	if (state != ABORTING)
+		release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(!ep);
+
+	/* The cm_id may be null if we failed to connect */
+	switch (state_read(&ep->com)) {
+	case CLOSING:
+		start_ep_timer(ep);
+		state_set(&ep->com, MORIBUND);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if ((ep->com.cm_id) && (ep->com.qp)) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+					     ep->com.qp, 
+					     IWCH_QP_ATTR_NEXT_STATE,
+					     &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		release_ep_resources(ep);
+		break;
+	case DEAD:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * T3A does 3 things when a TERM is received:
+ * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet
+ * 2) generate an async event on the QP with the TERMINATE opcode
+ * 3) post a TERMINATE opcde cqe into the associated CQ.
+ *
+ * For (1), we save the message in the qp for later consumer consumption.
+ * For (2), we move the QP into TERMINATE, post a QP event and disconnect.
+ * For (3), we toss the CQE in cxio_poll_cq().
+ * 
+ * terminate() handles case (1)...
+ */
+static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
+	PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len);
+	memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len);
+	ep->com.qp->attr.terminate_msg_len = skb->len;
+	ep->com.qp->attr.is_terminate_local = 0;
+	return CPL_RET_BUF_DONE;
+}
+
+static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_rdma_ec_status *rep = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid, 
+	     rep->status);
+	if (rep->status) {
+		struct iwch_qp_attributes attrs;
+
+		printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n",
+		       __FUNCTION__, ep->hwtid);
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(ep->com.qp->rhp,
+			       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+			       &attrs, 1);
+		abort_connection(ep, NULL);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static void ep_timeout(unsigned long arg)
+{
+	struct iwch_ep *ep = (struct iwch_ep *)arg;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_comp_exch(&ep->com, MPA_REQ_SENT, CLOSING)) {
+		struct sk_buff *skb;
+
+		connect_reply_upcall(ep, -ETIMEDOUT);
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MPA_REQ_WAIT, CLOSING)) {
+		struct sk_buff *skb;
+
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	if (state_comp_exch(&ep->com, MORIBUND, ABORTING)) {
+		struct sk_buff *skb;
+
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		}
+		skb = alloc_skb(sizeof(struct cpl_abort_req), GFP_ATOMIC);
+		if (skb)
+			abort_connection(ep, skb);
+	}
+	put_ep(&ep->com);
+}
+
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	state_set(&ep->com, CLOSING);
+	if (mpa_rev == 0)
+		abort_connection(ep, NULL);
+	else {
+		err = send_mpa_reject(ep, pdata, pdata_len);
+		err = send_halfclose(ep, GFP_KERNEL);
+	}
+	return 0;
+}
+
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	struct iwch_ep *ep = to_ep(cm_id);
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	BUG_ON(!qp);
+
+	if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) ||
+	    (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) {
+		abort_connection(ep, NULL);
+		return -EINVAL;
+	}
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = qp;
+
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+	get_ep(&ep->com);
+	err = send_mpa_reply(ep, conn_param->private_data, 
+			     conn_param->private_data_len);
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+		put_ep(&ep->com);
+		return err;
+	}
+	
+	/* bind QP to EP and move to RTS */
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ord;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	/* bind QP and TID with INIT_WR */
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+			     IWCH_QP_ATTR_LLP_STREAM_HANDLE | 
+			     IWCH_QP_ATTR_MPA_ATTR |
+			     IWCH_QP_ATTR_MAX_IRD |
+			     IWCH_QP_ATTR_MAX_ORD;
+
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL);
+	} else {
+		state_set(&ep->com, FPDU_MODE);
+		established_upcall(ep);
+	}
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_ep *ep;
+	struct rtable *rt;
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto out;
+	}
+	init_timer(&ep->timer);
+	ep->plen = conn_param->private_data_len;
+	if (ep->plen)
+		memcpy(ep->mpa_pkt + sizeof(struct mpa_message), 
+		       conn_param->private_data, ep->plen);
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	ep->com.tdev = h->rdev.t3cdev_p;
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+	PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn, 
+	     ep->com.qp, cm_id);
+
+	/* 
+	 * Allocate an active TID to initiate a TCP connection. 
+	 */
+	ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->atid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	/* find a route */
+	rt = find_route(h->rdev.t3cdev_p,
+			cm_id->local_addr.sin_addr.s_addr,
+			cm_id->remote_addr.sin_addr.s_addr,
+			cm_id->local_addr.sin_port,
+			cm_id->remote_addr.sin_port, IPTOS_LOWDELAY);
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__);
+		err = -EHOSTUNREACH;
+		goto fail3;
+	}
+	ep->dst = &rt->u.dst;
+
+	/* get a l2t entry */
+	ep->l2t = t3_l2t_get(ep->com.tdev, ep->dst->neighbour,
+			     ep->dst->neighbour->dev);
+	if (!ep->l2t) {
+		printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail4;
+	}
+
+	state_set(&ep->com, CONNECTING);
+	ep->tos = IPTOS_LOWDELAY;
+	ep->com.local_addr = cm_id->local_addr;
+	ep->com.remote_addr = cm_id->remote_addr;
+
+	/* send connect request to rnic */
+	err = send_connect(ep);
+	if (!err)
+		goto out;
+
+	l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t);
+fail4:
+	dst_release(ep->dst);
+fail3:
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+fail2:
+	put_ep(&ep->com);
+out:
+	return err;
+}
+
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_listen_ep *ep;
+
+
+	might_sleep();
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail1;
+	}
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.tdev = h->rdev.t3cdev_p;
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->backlog = backlog;
+	ep->com.local_addr = cm_id->local_addr;
+
+	/* 
+	 * Allocate a server TID.
+	 */
+	ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->stid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	state_set(&ep->com, LISTEN);
+	err = listen_start(ep);
+	if (err)
+		goto fail3;
+
+	/* wait for pass_open_rpl */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (!err) {
+		cm_id->provider_data = ep;
+		goto out;
+	}
+fail3:
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+fail2:
+	put_ep(&ep->com);
+fail1:
+out:
+	return err;
+}
+
+int iwch_destroy_listen(struct iw_cm_id *cm_id)
+{
+	int err;
+	struct iwch_listen_ep *ep = to_listen_ep(cm_id);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	might_sleep();
+	state_set(&ep->com, DEAD);
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	err = listen_stop(ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+	err = ep->com.rpl_err;
+	cm_id->rem_ref(cm_id);
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
+{
+	int ret=0;
+	int state;
+
+	
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep, 
+	     states[state], abrupt);
+	if (state == DEAD) {
+		PDBG("%s already dead ep %p\n", __FUNCTION__, ep);
+		return 0;
+	}
+	if (abrupt) {
+		if (state != ABORTING) {
+			state_set(&ep->com, ABORTING);
+			ret = send_abort(ep, NULL, gfp);
+		}
+	} else {
+
+		if (state != CLOSING)
+			state_set(&ep->com, CLOSING);
+		else {
+			start_ep_timer(ep);
+			state_set(&ep->com, MORIBUND);
+		}
+
+		ret = send_halfclose(ep, gfp);
+	}
+	return ret;
+}
+
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, 
+		     struct l2t_entry *l2t)
+{
+	struct iwch_ep *ep = ctx;
+	
+	if (ep->dst != old)
+		return 0;
+
+	PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new, 
+	     l2t);
+	dst_hold(new);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	ep->l2t = l2t;
+	dst_release(old);
+	ep->dst = new;
+	return 1;
+}
+
+/* 
+ * All the CM events are handled on a work queue to have a safe context.
+ */
+static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep_common *epc = ctx;
+
+	get_ep(epc);
+
+	/*
+	 * Save ctx and tdev in the skb->cb area.
+	 */
+	*((void **) skb->cb) = ctx;
+	*((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev;
+
+	/* 
+	 * Queue the skb and schedule the worker thread.
+	 */
+	skb_queue_tail(&rxq, skb);
+	queue_work(workq, &skb_work);
+	return 0;
+}
+
+int __init iwch_cm_init(void)
+{
+	skb_queue_head_init(&rxq);
+
+	workq = create_singlethread_workqueue("iw_cxgb3");
+	if (!workq)
+		return -ENOMEM;
+
+	/*
+	 * All upcalls from the T3 Core go to sched() to 
+	 * schedule the processing on a work queue.
+	 */
+	t3c_handlers[CPL_ACT_ESTABLISH] = sched;
+	t3c_handlers[CPL_ACT_OPEN_RPL] = sched;
+	t3c_handlers[CPL_RX_DATA] = sched;
+	t3c_handlers[CPL_TX_DMA_ACK] = sched;
+	t3c_handlers[CPL_ABORT_RPL_RSS] = sched;
+	t3c_handlers[CPL_ABORT_RPL] = sched;
+	t3c_handlers[CPL_PASS_OPEN_RPL] = sched;
+	t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched;
+	t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched;
+	t3c_handlers[CPL_PASS_ESTABLISH] = sched;
+	t3c_handlers[CPL_PEER_CLOSE] = sched;
+	t3c_handlers[CPL_CLOSE_CON_RPL] = sched;
+	t3c_handlers[CPL_ABORT_REQ_RSS] = sched;
+	t3c_handlers[CPL_RDMA_TERMINATE] = sched;
+	t3c_handlers[CPL_RDMA_EC_STATUS] = sched;
+
+	/*
+	 * These are the real handlers that are called from a 
+	 * work queue.
+	 */
+	work_handlers[CPL_ACT_ESTABLISH] = act_establish;
+	work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl;
+	work_handlers[CPL_RX_DATA] = rx_data;
+	work_handlers[CPL_TX_DMA_ACK] = tx_ack;
+	work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl;
+	work_handlers[CPL_ABORT_RPL] = abort_rpl;
+	work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl;
+	work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl;
+	work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req;
+	work_handlers[CPL_PASS_ESTABLISH] = pass_establish;
+	work_handlers[CPL_PEER_CLOSE] = peer_close;
+	work_handlers[CPL_ABORT_REQ_RSS] = peer_abort;
+	work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl;
+	work_handlers[CPL_RDMA_TERMINATE] = terminate;
+	work_handlers[CPL_RDMA_EC_STATUS] = ec_status;
+	return 0;
+}
+
+void __exit iwch_cm_term(void)
+{
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
new file mode 100644
index 0000000..893f9d0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _IWCH_CM_H_
+#define _IWCH_CM_H_
+
+#include <linux/inet.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/iw_cm.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+
+#define MPA_KEY_REQ "MPA ID Req Frame"
+#define MPA_KEY_REP "MPA ID Rep Frame"
+
+#define MPA_MAX_PRIVATE_DATA 	256
+#define MPA_REV 		0	/* XXX - amso1100 uses rev 0 ! */
+#define MPA_REJECT 		0x20
+#define MPA_CRC			0x40
+#define MPA_MARKERS		0x80
+#define MPA_FLAGS_MASK		0xE0
+
+#define put_ep(ep) { \
+	PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__,  \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_put(&((ep)->kref), __free_ep); \
+}
+
+#define get_ep(ep) { \
+	PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_get(&((ep)->kref));  \
+}
+
+struct mpa_message {
+	u8 key[16];
+	u8 flags;
+	u8 revision;
+	__be16 private_data_size;
+	u8 private_data[0];
+};
+
+struct terminate_message {
+	u8 layer_etype;
+	u8 ecode;
+	__be16 hdrct_rsvd;
+	u8 len_hdrs[0];
+};
+
+#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28)
+
+enum iwch_layers_types {
+	LAYER_RDMAP 		= 0x00,
+	LAYER_DDP		= 0x10,
+	LAYER_MPA		= 0x20,
+	RDMAP_LOCAL_CATA	= 0x00,
+	RDMAP_REMOTE_PROT	= 0x01,
+	RDMAP_REMOTE_OP		= 0x02,
+	DDP_LOCAL_CATA		= 0x00,
+	DDP_TAGGED_ERR		= 0x01,
+	DDP_UNTAGGED_ERR	= 0x02,
+	DDP_LLP			= 0x03
+};
+
+enum iwch_rdma_ecodes {
+	RDMAP_INV_STAG		= 0x00,
+	RDMAP_BASE_BOUNDS	= 0x01,
+	RDMAP_ACC_VIOL		= 0x02,
+	RDMAP_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_TO_WRAP		= 0x04,
+	RDMAP_INV_VERS		= 0x05,
+	RDMAP_INV_OPCODE	= 0x06,
+	RDMAP_STREAM_CATA	= 0x07,
+	RDMAP_GLOBAL_CATA	= 0x08,
+	RDMAP_CANT_INV_STAG	= 0x09,
+	RDMAP_UNSPECIFIED	= 0xff	
+};
+
+enum iwch_ddp_ecodes {
+	DDPT_INV_STAG		= 0x00,
+	DDPT_BASE_BOUNDS	= 0x01,
+	DDPT_STAG_NOT_ASSOC	= 0x02,
+	DDPT_TO_WRAP		= 0x03,
+	DDPT_INV_VERS		= 0x04,
+	DDPU_INV_QN		= 0x01,
+	DDPU_INV_MSN_NOBUF	= 0x02,
+	DDPU_INV_MSN_RANGE	= 0x03,
+	DDPU_INV_MO		= 0x04,
+	DDPU_MSG_TOOBIG		= 0x05,
+	DDPU_INV_VERS		= 0x06
+};
+
+enum iwch_mpa_ecodes {
+	MPA_CRC_ERR		= 0x02,
+	MPA_MARKER_ERR		= 0x03
+};
+
+enum iwch_ep_state {
+	IDLE = 0,
+	LISTEN,	
+	CONNECTING,
+	MPA_REQ_WAIT,
+	MPA_REQ_SENT,
+	MPA_REQ_RCVD,
+	MPA_REP_SENT,
+	FPDU_MODE,
+	ABORTING,
+	CLOSING,
+	MORIBUND,
+	DEAD,
+};
+
+struct iwch_ep_common {
+	struct iw_cm_id *cm_id;
+	struct iwch_qp *qp;
+	struct t3cdev *tdev;
+	enum iwch_ep_state state;
+	struct kref kref;
+	spinlock_t lock;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	wait_queue_head_t waitq;
+	int rpl_done;
+	int rpl_err;
+};
+
+struct iwch_listen_ep {
+	struct iwch_ep_common com;
+	unsigned int stid;
+	int backlog;
+};
+
+struct iwch_ep {
+	struct iwch_ep_common com;
+	struct iwch_ep *parent_ep;
+	struct timer_list timer;
+	unsigned int atid;
+	u32 hwtid;
+	u32 snd_seq;
+	struct l2t_entry *l2t;
+	struct dst_entry *dst;
+	struct sk_buff *mpa_skb;
+	struct iwch_mpa_attributes mpa_attr;
+	unsigned int mpa_pkt_len;
+	u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA];
+	u8 tos;
+	u16 emss;
+	u16 plen;
+	u32 ird;
+	u32 ord;
+};
+
+static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_ep *)cm_id->provider_data;
+}
+
+static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_listen_ep *)cm_id->provider_data;
+}
+
+static inline int compute_wscale(int win)
+{
+	int wscale = 0;
+
+	while (wscale < 14 && (65535<<wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+/* CM prototypes */
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog);
+int iwch_destroy_listen(struct iw_cm_id *cm_id);
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp);
+int iwch_quiesce_tid(struct iwch_ep *ep);
+int iwch_resume_tid(struct iwch_ep *ep);
+void __free_ep(struct kref *kref);
+void iwch_rearp(struct iwch_ep *ep);
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t);
+
+int __init iwch_cm_init(void);
+void __exit iwch_cm_term(void);
+
+#endif				/* _IWCH_CM_H_ */
diff --git a/drivers/infiniband/hw/cxgb3/tcb.h b/drivers/infiniband/hw/cxgb3/tcb.h
new file mode 100644
index 0000000..f287a7c
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/tcb.h
@@ -0,0 +1,603 @@
+/* This file is automatically generated --- do not edit */
+
+#ifndef _TCB_DEFS_H
+#define _TCB_DEFS_H
+
+#define W_TCB_T_STATE    0
+#define S_TCB_T_STATE    0
+#define M_TCB_T_STATE    0xfULL
+#define V_TCB_T_STATE(x) ((x) << S_TCB_T_STATE)
+
+#define W_TCB_TIMER    0
+#define S_TCB_TIMER    4
+#define M_TCB_TIMER    0x1ULL
+#define V_TCB_TIMER(x) ((x) << S_TCB_TIMER)
+
+#define W_TCB_DACK_TIMER    0
+#define S_TCB_DACK_TIMER    5
+#define M_TCB_DACK_TIMER    0x1ULL
+#define V_TCB_DACK_TIMER(x) ((x) << S_TCB_DACK_TIMER)
+
+#define W_TCB_DEL_FLAG    0
+#define S_TCB_DEL_FLAG    6
+#define M_TCB_DEL_FLAG    0x1ULL
+#define V_TCB_DEL_FLAG(x) ((x) << S_TCB_DEL_FLAG)
+
+#define W_TCB_L2T_IX    0
+#define S_TCB_L2T_IX    7
+#define M_TCB_L2T_IX    0x7ffULL
+#define V_TCB_L2T_IX(x) ((x) << S_TCB_L2T_IX)
+
+#define W_TCB_SMAC_SEL    0
+#define S_TCB_SMAC_SEL    18
+#define M_TCB_SMAC_SEL    0x3ULL
+#define V_TCB_SMAC_SEL(x) ((x) << S_TCB_SMAC_SEL)
+
+#define W_TCB_TOS    0
+#define S_TCB_TOS    20
+#define M_TCB_TOS    0x3fULL
+#define V_TCB_TOS(x) ((x) << S_TCB_TOS)
+
+#define W_TCB_MAX_RT    0
+#define S_TCB_MAX_RT    26
+#define M_TCB_MAX_RT    0xfULL
+#define V_TCB_MAX_RT(x) ((x) << S_TCB_MAX_RT)
+
+#define W_TCB_T_RXTSHIFT    0
+#define S_TCB_T_RXTSHIFT    30
+#define M_TCB_T_RXTSHIFT    0xfULL
+#define V_TCB_T_RXTSHIFT(x) ((x) << S_TCB_T_RXTSHIFT)
+
+#define W_TCB_T_DUPACKS    1
+#define S_TCB_T_DUPACKS    2
+#define M_TCB_T_DUPACKS    0xfULL
+#define V_TCB_T_DUPACKS(x) ((x) << S_TCB_T_DUPACKS)
+
+#define W_TCB_T_MAXSEG    1
+#define S_TCB_T_MAXSEG    6
+#define M_TCB_T_MAXSEG    0xfULL
+#define V_TCB_T_MAXSEG(x) ((x) << S_TCB_T_MAXSEG)
+
+#define W_TCB_T_FLAGS1    1
+#define S_TCB_T_FLAGS1    10
+#define M_TCB_T_FLAGS1    0xffffffffULL
+#define V_TCB_T_FLAGS1(x) ((x) << S_TCB_T_FLAGS1)
+
+#define W_TCB_T_MIGRATION    1
+#define S_TCB_T_MIGRATION    20
+#define M_TCB_T_MIGRATION    0x1ULL
+#define V_TCB_T_MIGRATION(x) ((x) << S_TCB_T_MIGRATION)
+
+#define W_TCB_T_FLAGS2    2
+#define S_TCB_T_FLAGS2    10
+#define M_TCB_T_FLAGS2    0x7fULL
+#define V_TCB_T_FLAGS2(x) ((x) << S_TCB_T_FLAGS2)
+
+#define W_TCB_SND_SCALE    2
+#define S_TCB_SND_SCALE    17
+#define M_TCB_SND_SCALE    0xfULL
+#define V_TCB_SND_SCALE(x) ((x) << S_TCB_SND_SCALE)
+
+#define W_TCB_RCV_SCALE    2
+#define S_TCB_RCV_SCALE    21
+#define M_TCB_RCV_SCALE    0xfULL
+#define V_TCB_RCV_SCALE(x) ((x) << S_TCB_RCV_SCALE)
+
+#define W_TCB_SND_UNA_RAW    2
+#define S_TCB_SND_UNA_RAW    25
+#define M_TCB_SND_UNA_RAW    0x7ffffffULL
+#define V_TCB_SND_UNA_RAW(x) ((x) << S_TCB_SND_UNA_RAW)
+
+#define W_TCB_SND_NXT_RAW    3
+#define S_TCB_SND_NXT_RAW    20
+#define M_TCB_SND_NXT_RAW    0x7ffffffULL
+#define V_TCB_SND_NXT_RAW(x) ((x) << S_TCB_SND_NXT_RAW)
+
+#define W_TCB_RCV_NXT    4
+#define S_TCB_RCV_NXT    15
+#define M_TCB_RCV_NXT    0xffffffffULL
+#define V_TCB_RCV_NXT(x) ((x) << S_TCB_RCV_NXT)
+
+#define W_TCB_RCV_ADV    5
+#define S_TCB_RCV_ADV    15
+#define M_TCB_RCV_ADV    0xffffULL
+#define V_TCB_RCV_ADV(x) ((x) << S_TCB_RCV_ADV)
+
+#define W_TCB_SND_MAX_RAW    5
+#define S_TCB_SND_MAX_RAW    31
+#define M_TCB_SND_MAX_RAW    0x7ffffffULL
+#define V_TCB_SND_MAX_RAW(x) ((x) << S_TCB_SND_MAX_RAW)
+
+#define W_TCB_SND_CWND    6
+#define S_TCB_SND_CWND    26
+#define M_TCB_SND_CWND    0x7ffffffULL
+#define V_TCB_SND_CWND(x) ((x) << S_TCB_SND_CWND)
+
+#define W_TCB_SND_SSTHRESH    7
+#define S_TCB_SND_SSTHRESH    21
+#define M_TCB_SND_SSTHRESH    0x7ffffffULL
+#define V_TCB_SND_SSTHRESH(x) ((x) << S_TCB_SND_SSTHRESH)
+
+#define W_TCB_T_RTT_TS_RECENT_AGE    8
+#define S_TCB_T_RTT_TS_RECENT_AGE    16
+#define M_TCB_T_RTT_TS_RECENT_AGE    0xffffffffULL
+#define V_TCB_T_RTT_TS_RECENT_AGE(x) ((x) << S_TCB_T_RTT_TS_RECENT_AGE)
+
+#define W_TCB_T_RTSEQ_RECENT    9
+#define S_TCB_T_RTSEQ_RECENT    16
+#define M_TCB_T_RTSEQ_RECENT    0xffffffffULL
+#define V_TCB_T_RTSEQ_RECENT(x) ((x) << S_TCB_T_RTSEQ_RECENT)
+
+#define W_TCB_T_SRTT    10
+#define S_TCB_T_SRTT    16
+#define M_TCB_T_SRTT    0xffffULL
+#define V_TCB_T_SRTT(x) ((x) << S_TCB_T_SRTT)
+
+#define W_TCB_T_RTTVAR    11
+#define S_TCB_T_RTTVAR    0
+#define M_TCB_T_RTTVAR    0xffffULL
+#define V_TCB_T_RTTVAR(x) ((x) << S_TCB_T_RTTVAR)
+
+#define W_TCB_TS_LAST_ACK_SENT_RAW    11
+#define S_TCB_TS_LAST_ACK_SENT_RAW    16
+#define M_TCB_TS_LAST_ACK_SENT_RAW    0x7ffffffULL
+#define V_TCB_TS_LAST_ACK_SENT_RAW(x) ((x) << S_TCB_TS_LAST_ACK_SENT_RAW)
+
+#define W_TCB_DIP    12
+#define S_TCB_DIP    11
+#define M_TCB_DIP    0xffffffffULL
+#define V_TCB_DIP(x) ((x) << S_TCB_DIP)
+
+#define W_TCB_SIP    13
+#define S_TCB_SIP    11
+#define M_TCB_SIP    0xffffffffULL
+#define V_TCB_SIP(x) ((x) << S_TCB_SIP)
+
+#define W_TCB_DP    14
+#define S_TCB_DP    11
+#define M_TCB_DP    0xffffULL
+#define V_TCB_DP(x) ((x) << S_TCB_DP)
+
+#define W_TCB_SP    14
+#define S_TCB_SP    27
+#define M_TCB_SP    0xffffULL
+#define V_TCB_SP(x) ((x) << S_TCB_SP)
+
+#define W_TCB_TIMESTAMP    15
+#define S_TCB_TIMESTAMP    11
+#define M_TCB_TIMESTAMP    0xffffffffULL
+#define V_TCB_TIMESTAMP(x) ((x) << S_TCB_TIMESTAMP)
+
+#define W_TCB_TIMESTAMP_OFFSET    16
+#define S_TCB_TIMESTAMP_OFFSET    11
+#define M_TCB_TIMESTAMP_OFFSET    0xfULL
+#define V_TCB_TIMESTAMP_OFFSET(x) ((x) << S_TCB_TIMESTAMP_OFFSET)
+
+#define W_TCB_TX_MAX    16
+#define S_TCB_TX_MAX    15
+#define M_TCB_TX_MAX    0xffffffffULL
+#define V_TCB_TX_MAX(x) ((x) << S_TCB_TX_MAX)
+
+#define W_TCB_TX_HDR_PTR_RAW    17
+#define S_TCB_TX_HDR_PTR_RAW    15
+#define M_TCB_TX_HDR_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_HDR_PTR_RAW(x) ((x) << S_TCB_TX_HDR_PTR_RAW)
+
+#define W_TCB_TX_LAST_PTR_RAW    18
+#define S_TCB_TX_LAST_PTR_RAW    0
+#define M_TCB_TX_LAST_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_LAST_PTR_RAW(x) ((x) << S_TCB_TX_LAST_PTR_RAW)
+
+#define W_TCB_TX_COMPACT    18
+#define S_TCB_TX_COMPACT    17
+#define M_TCB_TX_COMPACT    0x1ULL
+#define V_TCB_TX_COMPACT(x) ((x) << S_TCB_TX_COMPACT)
+
+#define W_TCB_RX_COMPACT    18
+#define S_TCB_RX_COMPACT    18
+#define M_TCB_RX_COMPACT    0x1ULL
+#define V_TCB_RX_COMPACT(x) ((x) << S_TCB_RX_COMPACT)
+
+#define W_TCB_RCV_WND    18
+#define S_TCB_RCV_WND    19
+#define M_TCB_RCV_WND    0x7ffffffULL
+#define V_TCB_RCV_WND(x) ((x) << S_TCB_RCV_WND)
+
+#define W_TCB_RX_HDR_OFFSET    19
+#define S_TCB_RX_HDR_OFFSET    14
+#define M_TCB_RX_HDR_OFFSET    0x7ffffffULL
+#define V_TCB_RX_HDR_OFFSET(x) ((x) << S_TCB_RX_HDR_OFFSET)
+
+#define W_TCB_RX_FRAG0_START_IDX_RAW    20
+#define S_TCB_RX_FRAG0_START_IDX_RAW    9
+#define M_TCB_RX_FRAG0_START_IDX_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG0_START_IDX_RAW(x) ((x) << S_TCB_RX_FRAG0_START_IDX_RAW)
+
+#define W_TCB_RX_FRAG1_START_IDX_OFFSET    21
+#define S_TCB_RX_FRAG1_START_IDX_OFFSET    4
+#define M_TCB_RX_FRAG1_START_IDX_OFFSET    0x7ffffffULL
+#define V_TCB_RX_FRAG1_START_IDX_OFFSET(x) ((x) << S_TCB_RX_FRAG1_START_IDX_OFFSET)
+
+#define W_TCB_RX_FRAG0_LEN    21
+#define S_TCB_RX_FRAG0_LEN    31
+#define M_TCB_RX_FRAG0_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG0_LEN(x) ((x) << S_TCB_RX_FRAG0_LEN)
+
+#define W_TCB_RX_FRAG1_LEN    22
+#define S_TCB_RX_FRAG1_LEN    26
+#define M_TCB_RX_FRAG1_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG1_LEN(x) ((x) << S_TCB_RX_FRAG1_LEN)
+
+#define W_TCB_NEWRENO_RECOVER    23
+#define S_TCB_NEWRENO_RECOVER    21
+#define M_TCB_NEWRENO_RECOVER    0x7ffffffULL
+#define V_TCB_NEWRENO_RECOVER(x) ((x) << S_TCB_NEWRENO_RECOVER)
+
+#define W_TCB_PDU_HAVE_LEN    24
+#define S_TCB_PDU_HAVE_LEN    16
+#define M_TCB_PDU_HAVE_LEN    0x1ULL
+#define V_TCB_PDU_HAVE_LEN(x) ((x) << S_TCB_PDU_HAVE_LEN)
+
+#define W_TCB_PDU_LEN    24
+#define S_TCB_PDU_LEN    17
+#define M_TCB_PDU_LEN    0xffffULL
+#define V_TCB_PDU_LEN(x) ((x) << S_TCB_PDU_LEN)
+
+#define W_TCB_RX_QUIESCE    25
+#define S_TCB_RX_QUIESCE    1
+#define M_TCB_RX_QUIESCE    0x1ULL
+#define V_TCB_RX_QUIESCE(x) ((x) << S_TCB_RX_QUIESCE)
+
+#define W_TCB_RX_PTR_RAW    25
+#define S_TCB_RX_PTR_RAW    2
+#define M_TCB_RX_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_PTR_RAW(x) ((x) << S_TCB_RX_PTR_RAW)
+
+#define W_TCB_CPU_NO    25
+#define S_TCB_CPU_NO    19
+#define M_TCB_CPU_NO    0x7fULL
+#define V_TCB_CPU_NO(x) ((x) << S_TCB_CPU_NO)
+
+#define W_TCB_ULP_TYPE    25
+#define S_TCB_ULP_TYPE    26
+#define M_TCB_ULP_TYPE    0xfULL
+#define V_TCB_ULP_TYPE(x) ((x) << S_TCB_ULP_TYPE)
+
+#define W_TCB_RX_FRAG1_PTR_RAW    25
+#define S_TCB_RX_FRAG1_PTR_RAW    30
+#define M_TCB_RX_FRAG1_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    26
+#define S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    15
+#define M_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW)
+
+#define W_TCB_RX_FRAG2_PTR_RAW    27
+#define S_TCB_RX_FRAG2_PTR_RAW    10
+#define M_TCB_RX_FRAG2_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG2_PTR_RAW(x) ((x) << S_TCB_RX_FRAG2_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_LEN_RAW    27
+#define S_TCB_RX_FRAG2_LEN_RAW    27
+#define M_TCB_RX_FRAG2_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_LEN_RAW(x) ((x) << S_TCB_RX_FRAG2_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_PTR_RAW    28
+#define S_TCB_RX_FRAG3_PTR_RAW    22
+#define M_TCB_RX_FRAG3_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG3_PTR_RAW(x) ((x) << S_TCB_RX_FRAG3_PTR_RAW)
+
+#define W_TCB_RX_FRAG3_LEN_RAW    29
+#define S_TCB_RX_FRAG3_LEN_RAW    7
+#define M_TCB_RX_FRAG3_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_LEN_RAW(x) ((x) << S_TCB_RX_FRAG3_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    30
+#define S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    2
+#define M_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW)
+
+#define W_TCB_PDU_HDR_LEN    30
+#define S_TCB_PDU_HDR_LEN    29
+#define M_TCB_PDU_HDR_LEN    0xffULL
+#define V_TCB_PDU_HDR_LEN(x) ((x) << S_TCB_PDU_HDR_LEN)
+
+#define W_TCB_SLUSH1    31
+#define S_TCB_SLUSH1    5
+#define M_TCB_SLUSH1    0x7ffffULL
+#define V_TCB_SLUSH1(x) ((x) << S_TCB_SLUSH1)
+
+#define W_TCB_ULP_RAW    31
+#define S_TCB_ULP_RAW    24
+#define M_TCB_ULP_RAW    0xffULL
+#define V_TCB_ULP_RAW(x) ((x) << S_TCB_ULP_RAW)
+
+#define W_TCB_DDP_RDMAP_VERSION    25
+#define S_TCB_DDP_RDMAP_VERSION    30
+#define M_TCB_DDP_RDMAP_VERSION    0x1ULL
+#define V_TCB_DDP_RDMAP_VERSION(x) ((x) << S_TCB_DDP_RDMAP_VERSION)
+
+#define W_TCB_MARKER_ENABLE_RX    25
+#define S_TCB_MARKER_ENABLE_RX    31
+#define M_TCB_MARKER_ENABLE_RX    0x1ULL
+#define V_TCB_MARKER_ENABLE_RX(x) ((x) << S_TCB_MARKER_ENABLE_RX)
+
+#define W_TCB_MARKER_ENABLE_TX    26
+#define S_TCB_MARKER_ENABLE_TX    0
+#define M_TCB_MARKER_ENABLE_TX    0x1ULL
+#define V_TCB_MARKER_ENABLE_TX(x) ((x) << S_TCB_MARKER_ENABLE_TX)
+
+#define W_TCB_CRC_ENABLE    26
+#define S_TCB_CRC_ENABLE    1
+#define M_TCB_CRC_ENABLE    0x1ULL
+#define V_TCB_CRC_ENABLE(x) ((x) << S_TCB_CRC_ENABLE)
+
+#define W_TCB_IRS_ULP    26
+#define S_TCB_IRS_ULP    2
+#define M_TCB_IRS_ULP    0x1ffULL
+#define V_TCB_IRS_ULP(x) ((x) << S_TCB_IRS_ULP)
+
+#define W_TCB_ISS_ULP    26
+#define S_TCB_ISS_ULP    11
+#define M_TCB_ISS_ULP    0x1ffULL
+#define V_TCB_ISS_ULP(x) ((x) << S_TCB_ISS_ULP)
+
+#define W_TCB_TX_PDU_LEN    26
+#define S_TCB_TX_PDU_LEN    20
+#define M_TCB_TX_PDU_LEN    0x3fffULL
+#define V_TCB_TX_PDU_LEN(x) ((x) << S_TCB_TX_PDU_LEN)
+
+#define W_TCB_TX_PDU_OUT    27
+#define S_TCB_TX_PDU_OUT    2
+#define M_TCB_TX_PDU_OUT    0x1ULL
+#define V_TCB_TX_PDU_OUT(x) ((x) << S_TCB_TX_PDU_OUT)
+
+#define W_TCB_CQ_IDX_SQ    27
+#define S_TCB_CQ_IDX_SQ    3
+#define M_TCB_CQ_IDX_SQ    0xffffULL
+#define V_TCB_CQ_IDX_SQ(x) ((x) << S_TCB_CQ_IDX_SQ)
+
+#define W_TCB_CQ_IDX_RQ    27
+#define S_TCB_CQ_IDX_RQ    19
+#define M_TCB_CQ_IDX_RQ    0xffffULL
+#define V_TCB_CQ_IDX_RQ(x) ((x) << S_TCB_CQ_IDX_RQ)
+
+#define W_TCB_QP_ID    28
+#define S_TCB_QP_ID    3
+#define M_TCB_QP_ID    0xffffULL
+#define V_TCB_QP_ID(x) ((x) << S_TCB_QP_ID)
+
+#define W_TCB_PD_ID    28
+#define S_TCB_PD_ID    19
+#define M_TCB_PD_ID    0xffffULL
+#define V_TCB_PD_ID(x) ((x) << S_TCB_PD_ID)
+
+#define W_TCB_STAG    29
+#define S_TCB_STAG    3
+#define M_TCB_STAG    0xffffffffULL
+#define V_TCB_STAG(x) ((x) << S_TCB_STAG)
+
+#define W_TCB_RQ_START    30
+#define S_TCB_RQ_START    3
+#define M_TCB_RQ_START    0x3ffffffULL
+#define V_TCB_RQ_START(x) ((x) << S_TCB_RQ_START)
+
+#define W_TCB_RQ_MSN    30
+#define S_TCB_RQ_MSN    29
+#define M_TCB_RQ_MSN    0x3ffULL
+#define V_TCB_RQ_MSN(x) ((x) << S_TCB_RQ_MSN)
+
+#define W_TCB_RQ_MAX_OFFSET    31
+#define S_TCB_RQ_MAX_OFFSET    7
+#define M_TCB_RQ_MAX_OFFSET    0xfULL
+#define V_TCB_RQ_MAX_OFFSET(x) ((x) << S_TCB_RQ_MAX_OFFSET)
+
+#define W_TCB_RQ_WRITE_PTR    31
+#define S_TCB_RQ_WRITE_PTR    11
+#define M_TCB_RQ_WRITE_PTR    0x3ffULL
+#define V_TCB_RQ_WRITE_PTR(x) ((x) << S_TCB_RQ_WRITE_PTR)
+
+#define W_TCB_INB_WRITE_PERM    31
+#define S_TCB_INB_WRITE_PERM    21
+#define M_TCB_INB_WRITE_PERM    0x1ULL
+#define V_TCB_INB_WRITE_PERM(x) ((x) << S_TCB_INB_WRITE_PERM)
+
+#define W_TCB_INB_READ_PERM    31
+#define S_TCB_INB_READ_PERM    22
+#define M_TCB_INB_READ_PERM    0x1ULL
+#define V_TCB_INB_READ_PERM(x) ((x) << S_TCB_INB_READ_PERM)
+
+#define W_TCB_ORD_L_BIT_VLD    31
+#define S_TCB_ORD_L_BIT_VLD    23
+#define M_TCB_ORD_L_BIT_VLD    0x1ULL
+#define V_TCB_ORD_L_BIT_VLD(x) ((x) << S_TCB_ORD_L_BIT_VLD)
+
+#define W_TCB_RDMAP_OPCODE    31
+#define S_TCB_RDMAP_OPCODE    24
+#define M_TCB_RDMAP_OPCODE    0xfULL
+#define V_TCB_RDMAP_OPCODE(x) ((x) << S_TCB_RDMAP_OPCODE)
+
+#define W_TCB_TX_FLUSH    31
+#define S_TCB_TX_FLUSH    28
+#define M_TCB_TX_FLUSH    0x1ULL
+#define V_TCB_TX_FLUSH(x) ((x) << S_TCB_TX_FLUSH)
+
+#define W_TCB_TX_OOS_RXMT    31
+#define S_TCB_TX_OOS_RXMT    29
+#define M_TCB_TX_OOS_RXMT    0x1ULL
+#define V_TCB_TX_OOS_RXMT(x) ((x) << S_TCB_TX_OOS_RXMT)
+
+#define W_TCB_TX_OOS_TXMT    31
+#define S_TCB_TX_OOS_TXMT    30
+#define M_TCB_TX_OOS_TXMT    0x1ULL
+#define V_TCB_TX_OOS_TXMT(x) ((x) << S_TCB_TX_OOS_TXMT)
+
+#define W_TCB_SLUSH_AUX2    31
+#define S_TCB_SLUSH_AUX2    31
+#define M_TCB_SLUSH_AUX2    0x1ULL
+#define V_TCB_SLUSH_AUX2(x) ((x) << S_TCB_SLUSH_AUX2)
+
+#define W_TCB_RX_FRAG1_PTR_RAW2    25
+#define S_TCB_RX_FRAG1_PTR_RAW2    30
+#define M_TCB_RX_FRAG1_PTR_RAW2    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW2(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW2)
+
+#define W_TCB_RX_DDP_FLAGS    26
+#define S_TCB_RX_DDP_FLAGS    15
+#define M_TCB_RX_DDP_FLAGS    0x3ffULL
+#define V_TCB_RX_DDP_FLAGS(x) ((x) << S_TCB_RX_DDP_FLAGS)
+
+#define W_TCB_SLUSH_AUX3    26
+#define S_TCB_SLUSH_AUX3    31
+#define M_TCB_SLUSH_AUX3    0x1ffULL
+#define V_TCB_SLUSH_AUX3(x) ((x) << S_TCB_SLUSH_AUX3)
+
+#define W_TCB_RX_DDP_BUF0_OFFSET    27
+#define S_TCB_RX_DDP_BUF0_OFFSET    8
+#define M_TCB_RX_DDP_BUF0_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF0_OFFSET)
+
+#define W_TCB_RX_DDP_BUF0_LEN    27
+#define S_TCB_RX_DDP_BUF0_LEN    30
+#define M_TCB_RX_DDP_BUF0_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_LEN(x) ((x) << S_TCB_RX_DDP_BUF0_LEN)
+
+#define W_TCB_RX_DDP_BUF1_OFFSET    28
+#define S_TCB_RX_DDP_BUF1_OFFSET    20
+#define M_TCB_RX_DDP_BUF1_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF1_OFFSET)
+
+#define W_TCB_RX_DDP_BUF1_LEN    29
+#define S_TCB_RX_DDP_BUF1_LEN    10
+#define M_TCB_RX_DDP_BUF1_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_LEN(x) ((x) << S_TCB_RX_DDP_BUF1_LEN)
+
+#define W_TCB_RX_DDP_BUF0_TAG    30
+#define S_TCB_RX_DDP_BUF0_TAG    0
+#define M_TCB_RX_DDP_BUF0_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF0_TAG(x) ((x) << S_TCB_RX_DDP_BUF0_TAG)
+
+#define W_TCB_RX_DDP_BUF1_TAG    31
+#define S_TCB_RX_DDP_BUF1_TAG    0
+#define M_TCB_RX_DDP_BUF1_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF1_TAG(x) ((x) << S_TCB_RX_DDP_BUF1_TAG)
+
+#define S_TF_DACK    10
+#define V_TF_DACK(x) ((x) << S_TF_DACK)
+
+#define S_TF_NAGLE    11
+#define V_TF_NAGLE(x) ((x) << S_TF_NAGLE)
+
+#define S_TF_RECV_SCALE    12
+#define V_TF_RECV_SCALE(x) ((x) << S_TF_RECV_SCALE)
+
+#define S_TF_RECV_TSTMP    13
+#define V_TF_RECV_TSTMP(x) ((x) << S_TF_RECV_TSTMP)
+
+#define S_TF_RECV_SACK    14
+#define V_TF_RECV_SACK(x) ((x) << S_TF_RECV_SACK)
+
+#define S_TF_TURBO    15
+#define V_TF_TURBO(x) ((x) << S_TF_TURBO)
+
+#define S_TF_KEEPALIVE    16
+#define V_TF_KEEPALIVE(x) ((x) << S_TF_KEEPALIVE)
+
+#define S_TF_TCAM_BYPASS    17
+#define V_TF_TCAM_BYPASS(x) ((x) << S_TF_TCAM_BYPASS)
+
+#define S_TF_CORE_FIN    18
+#define V_TF_CORE_FIN(x) ((x) << S_TF_CORE_FIN)
+
+#define S_TF_CORE_MORE    19
+#define V_TF_CORE_MORE(x) ((x) << S_TF_CORE_MORE)
+
+#define S_TF_MIGRATING    20
+#define V_TF_MIGRATING(x) ((x) << S_TF_MIGRATING)
+
+#define S_TF_ACTIVE_OPEN    21
+#define V_TF_ACTIVE_OPEN(x) ((x) << S_TF_ACTIVE_OPEN)
+
+#define S_TF_ASK_MODE    22
+#define V_TF_ASK_MODE(x) ((x) << S_TF_ASK_MODE)
+
+#define S_TF_NON_OFFLOAD    23
+#define V_TF_NON_OFFLOAD(x) ((x) << S_TF_NON_OFFLOAD)
+
+#define S_TF_MOD_SCHD    24
+#define V_TF_MOD_SCHD(x) ((x) << S_TF_MOD_SCHD)
+
+#define S_TF_MOD_SCHD_REASON0    25
+#define V_TF_MOD_SCHD_REASON0(x) ((x) << S_TF_MOD_SCHD_REASON0)
+
+#define S_TF_MOD_SCHD_REASON1    26
+#define V_TF_MOD_SCHD_REASON1(x) ((x) << S_TF_MOD_SCHD_REASON1)
+
+#define S_TF_MOD_SCHD_RX    27
+#define V_TF_MOD_SCHD_RX(x) ((x) << S_TF_MOD_SCHD_RX)
+
+#define S_TF_CORE_PUSH    28
+#define V_TF_CORE_PUSH(x) ((x) << S_TF_CORE_PUSH)
+
+#define S_TF_RCV_COALESCE_ENABLE    29
+#define V_TF_RCV_COALESCE_ENABLE(x) ((x) << S_TF_RCV_COALESCE_ENABLE)
+
+#define S_TF_RCV_COALESCE_PUSH    30
+#define V_TF_RCV_COALESCE_PUSH(x) ((x) << S_TF_RCV_COALESCE_PUSH)
+
+#define S_TF_RCV_COALESCE_LAST_PSH    31
+#define V_TF_RCV_COALESCE_LAST_PSH(x) ((x) << S_TF_RCV_COALESCE_LAST_PSH)
+
+#define S_TF_RCV_COALESCE_HEARTBEAT    32
+#define V_TF_RCV_COALESCE_HEARTBEAT(x) ((x) << S_TF_RCV_COALESCE_HEARTBEAT)
+
+#define S_TF_HALF_CLOSE    33
+#define V_TF_HALF_CLOSE(x) ((x) << S_TF_HALF_CLOSE)
+
+#define S_TF_DACK_MSS    34
+#define V_TF_DACK_MSS(x) ((x) << S_TF_DACK_MSS)
+
+#define S_TF_CCTRL_SEL0    35
+#define V_TF_CCTRL_SEL0(x) ((x) << S_TF_CCTRL_SEL0)
+
+#define S_TF_CCTRL_SEL1    36
+#define V_TF_CCTRL_SEL1(x) ((x) << S_TF_CCTRL_SEL1)
+
+#define S_TF_TCP_NEWRENO_FAST_RECOVERY    37
+#define V_TF_TCP_NEWRENO_FAST_RECOVERY(x) ((x) << S_TF_TCP_NEWRENO_FAST_RECOVERY)
+
+#define S_TF_TX_PACE_AUTO    38
+#define V_TF_TX_PACE_AUTO(x) ((x) << S_TF_TX_PACE_AUTO)
+
+#define S_TF_PEER_FIN_HELD    39
+#define V_TF_PEER_FIN_HELD(x) ((x) << S_TF_PEER_FIN_HELD)
+
+#define S_TF_CORE_URG    40
+#define V_TF_CORE_URG(x) ((x) << S_TF_CORE_URG)
+
+#define S_TF_RDMA_ERROR    41
+#define V_TF_RDMA_ERROR(x) ((x) << S_TF_RDMA_ERROR)
+
+#define S_TF_SSWS_DISABLED    42
+#define V_TF_SSWS_DISABLED(x) ((x) << S_TF_SSWS_DISABLED)
+
+#define S_TF_DUPACK_COUNT_ODD    43
+#define V_TF_DUPACK_COUNT_ODD(x) ((x) << S_TF_DUPACK_COUNT_ODD)
+
+#define S_TF_TX_CHANNEL    44
+#define V_TF_TX_CHANNEL(x) ((x) << S_TF_TX_CHANNEL)
+
+#define S_TF_RX_CHANNEL    45
+#define V_TF_RX_CHANNEL(x) ((x) << S_TF_RX_CHANNEL)
+
+#define S_TF_TX_PACE_FIXED    46
+#define V_TF_TX_PACE_FIXED(x) ((x) << S_TF_TX_PACE_FIXED)
+
+#define S_TF_RDMA_FLM_ERROR    47
+#define V_TF_RDMA_FLM_ERROR(x) ((x) << S_TF_RDMA_FLM_ERROR)
+
+#define S_TF_RX_FLOW_CONTROL_DISABLE    48
+#define V_TF_RX_FLOW_CONTROL_DISABLE(x) ((x) << S_TF_RX_FLOW_CONTROL_DISABLE)
+
+#endif /* _TCB_DEFS_H */


From swise at opengridcomputing.com  Thu Dec 14 05:55:06 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:55:06 -0600
Subject: [openib-general] [PATCH  v4 05/13] Queue Pairs
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135506.21159.2723.stgit@dell3.ogc.int>


Code to manipulate the QP.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++
 1 files changed, 1007 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 0000000..9f6b251
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1007 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 * flit_cnt)
+{
+	int i;
+	u32 plen;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE;
+		else
+			wqe->send.rdmaop = T3_SEND;
+		wqe->send.rem_stag = 0;
+		break;
+#if 0				/* Not currently supported */
+	case TYPE_SEND_INVALIDATE:
+	case TYPE_SEND_INVALIDATE_IMMEDIATE:
+		wqe->send.rdmaop = T3_SEND_WITH_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+	case TYPE_SEND_SE_INVALIDATE:
+		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+#endif
+	default:
+		break;
+	}
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->send.reserved[0] = 0;
+	wqe->send.reserved[1] = 0;
+	wqe->send.reserved[2] = 0;
+	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+		plen = 4;
+		wqe->send.sgl[0].stag = wr->imm_data;
+		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->send.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 5;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->send.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->send.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 4 + ((wr->num_sge) << 1);
+	}
+	wqe->send.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr,
+					u8 *flit_cnt)
+{
+	int i;
+	u32 plen;
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->write.rdmaop = T3_RDMA_WRITE;
+	wqe->write.reserved[0] = 0;
+	wqe->write.reserved[1] = 0;
+	wqe->write.reserved[2] = 0;
+	wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+	if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+		plen = 4;
+		wqe->write.sgl[0].stag = wr->imm_data;
+		wqe->write.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->write.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 6;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->write.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->write.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->write.sgl[i].to =
+			    cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->write.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 5 + ((wr->num_sge) << 1);
+	}
+	wqe->write.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 *flit_cnt)
+{
+	if (wr->num_sge > 1)
+		return -EINVAL;
+	wqe->read.rdmaop = T3_READ_REQ;
+	wqe->read.reserved[0] = 0;
+	wqe->read.reserved[1] = 0;
+	wqe->read.reserved[2] = 0;
+	wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr);
+	wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey);
+	wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length);
+	wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr);
+	*flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3;
+	return 0;
+}
+
+/* 
+ * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
+ */
+static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp,
+				   struct ib_sge *sg_list, u32 num_sgle,
+				   u32 * pbl_addr, u8 * page_size)
+{
+	int i;
+	struct iwch_mr *mhp;
+	u32 offset;
+	for (i = 0; i < num_sgle; i++) {
+
+		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
+		if (!mhp) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (!mhp->attr.state) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (mhp->attr.zbva) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+
+		if (sg_list[i].addr < mhp->attr.va_fbo) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) <
+		    sg_list[i].addr) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) >
+		    mhp->attr.va_fbo + ((u64) mhp->attr.len)) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		offset = sg_list[i].addr - mhp->attr.va_fbo;
+		offset += ((u32) mhp->attr.va_fbo) %
+		          (1UL << (12 + mhp->attr.page_size));
+		pbl_addr[i] = ((mhp->attr.pbl_addr - 
+			        rhp->rdev.rnic_info.pbl_base) >> 3) +
+			      (offset >> (12 + mhp->attr.page_size));
+		page_size[i] = mhp->attr.page_size;
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_recv(struct iwch_dev *rhp,
+						    union t3_wr *wqe,
+						    struct ib_recv_wr *wr)
+{
+	int i, err = 0;
+	u32 pbl_addr[4];
+	u8 page_size[4];
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr, 
+			       page_size);
+	if (err)
+		return err;
+	wqe->recv.pagesz[0] = page_size[0];
+	wqe->recv.pagesz[1] = page_size[1];
+	wqe->recv.pagesz[2] = page_size[2];
+	wqe->recv.pagesz[3] = page_size[3];
+	wqe->recv.num_sgle = cpu_to_be32(wr->num_sge);
+	for (i = 0; i < wr->num_sge; i++) {
+		wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		
+		/* to in the WQE == the offset into the page */
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
+				(1UL << (12 + page_size[i])));
+
+		/* pbl_addr is the adapters address in the PBL */
+		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
+	}
+	for (; i < T3_MAX_SGE; i++) {
+		wqe->recv.sgl[i].stag = 0;
+		wqe->recv.sgl[i].len = 0;
+		wqe->recv.sgl[i].to = 0;
+		wqe->recv.pbl_addr[i] = 0;
+	}
+	return 0;
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr)
+{
+	int err = 0;
+	u8 t3_wr_flit_cnt;
+	enum t3_wr_opcode t3_wr_opcode = 0;
+	enum t3_wr_flags t3_wr_flags;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+		  qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	while (wr) {
+		if (num_wrs == 0) {
+			err = -ENOMEM;
+			*bad_wr = wr;
+			break;
+		}
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		t3_wr_flags = 0;
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
+		if (wr->send_flags & IB_SEND_FENCE)
+			t3_wr_flags |= T3_READ_FENCE_FLAG;
+		if (wr->send_flags & IB_SEND_SIGNALED)
+			t3_wr_flags |= T3_COMPLETION_FLAG;
+		sqp = qhp->wq.sq + 
+		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+		switch (wr->opcode) {
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_IMM:
+			t3_wr_opcode = T3_WR_SEND;
+			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			t3_wr_opcode = T3_WR_WRITE;
+			err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_READ:
+			t3_wr_opcode = T3_WR_READ;
+			t3_wr_flags = 0; /* T3 reads are always signaled */
+			err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt);
+			if (err) 
+				break;
+			sqp->read_len = wqe->read.local_len;
+			if (!qhp->wq.oldest_read)
+				qhp->wq.oldest_read = sqp;
+			break;
+		default:
+			PDBG("%s post of type=%d TBD!\n", __FUNCTION__,
+			     wr->opcode);
+			err = -EINVAL;
+		}
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+		sqp->wr_id = wr->wr_id;
+		sqp->opcode = wr2opcode(t3_wr_opcode);
+		sqp->sq_wptr = qhp->wq.sq_wptr;
+		sqp->complete = 0;
+		sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED);
+
+		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, t3_wr_flit_cnt);
+		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n", 
+		     __FUNCTION__, wr->wr_id, idx, 
+		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
+		     sqp->opcode);
+		wr = wr->next;
+		num_wrs--;
+		++(qhp->wq.wptr);
+		++(qhp->wq.sq_wptr);
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	int err = 0;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr, 
+			    qhp->wq.rq_size_log2) - 1;
+	if (!wr) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	while (wr) {
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		if (num_wrs)
+			err = iwch_build_rdma_recv(qhp->rhp, wqe, wr);
+		else
+			err = -ENOMEM;
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] = 
+			wr->wr_id;
+		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, sizeof(struct t3_receive_wr) >> 3);
+		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
+		     "wqe %p \n", __FUNCTION__, wr->wr_id, idx, 
+		     qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
+		++(qhp->wq.rq_wptr);
+		++(qhp->wq.wptr);
+		wr = wr->next;
+		num_wrs--;
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_qp *qhp;
+	union t3_wr *wqe;
+	u32 pbl_addr;
+	u8 page_size;
+	u32 num_wrs;
+	unsigned long flag;
+	struct ib_sge sgl;
+	int err=0;
+	enum t3_wr_flags t3_wr_flags;
+	u32 idx;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(qp);
+	mhp = to_iwch_mw(mw);
+	rhp = qhp->rhp;
+
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr, 
+			    qhp->wq.sq_size_log2);
+	if ((num_wrs) <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx, 
+	     mw, mw_bind);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+
+	t3_wr_flags = 0;
+	if (mw_bind->send_flags & IB_SEND_SIGNALED)
+		t3_wr_flags = T3_COMPLETION_FLAG;
+
+        sgl.addr = mw_bind->addr;
+        sgl.lkey = mw_bind->mr->lkey;
+        sgl.length = mw_bind->length;
+        wqe->bind.reserved = 0;
+        wqe->bind.type = T3_VA_BASED_TO;
+
+        /* TBD: check perms */
+        wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags);
+        wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey);
+        wqe->bind.mw_stag = cpu_to_be32(mw->rkey);
+        wqe->bind.mw_len = cpu_to_be32(mw_bind->length);
+        wqe->bind.mw_va = cpu_to_be64(mw_bind->addr);
+        err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size);
+        if (err) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+                return err;
+	}
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+	sqp->wr_id = mw_bind->wr_id;
+	sqp->opcode = T3_BIND_MW;
+	sqp->sq_wptr = qhp->wq.sq_wptr;
+	sqp->complete = 0;
+	sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED);
+        wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr);
+        wqe->bind.mr_pagesz = page_size;
+	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
+	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0, 
+			        sizeof(struct t3_bind_mw_wr) >> 3);
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+
+	return err;
+}
+
+static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode,
+				    int tagged)
+{
+	switch (t3err) {
+	case TPT_ERR_STAG:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_STAG;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_INV_STAG;
+		}
+		break;
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_STAG_NOT_ASSOC;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_STAG_NOT_ASSOC;
+		}
+		break;
+	case TPT_ERR_WRAP:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+		*ecode = RDMAP_TO_WRAP;
+		break;
+	case TPT_ERR_BOUND:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_BASE_BOUNDS;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_BASE_BOUNDS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_MSG_TOOBIG;
+		}
+		break;
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_CANT_INV_STAG;
+		break;
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		*layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_OUT_OF_RQE:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_NOBUF;
+		break;
+	case TPT_ERR_PBL_ADDR_BOUND:
+		*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+		*ecode = DDPT_BASE_BOUNDS;
+		break;
+	case TPT_ERR_CRC:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_CRC_ERR;
+		break;
+	case TPT_ERR_MARKER:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_MARKER_ERR;
+		break;
+	case TPT_ERR_PDU_LEN_ERR:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_MSG_TOOBIG;
+		break;
+	case TPT_ERR_DDP_VERSION:
+		if (tagged) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_VERS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_INV_VERS;
+		}
+		break;
+	case TPT_ERR_RDMA_VERSION:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_VERS;
+		break;
+	case TPT_ERR_OPCODE:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_OPCODE;
+		break;
+	case TPT_ERR_DDP_QUEUE_NUM:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_QN;
+		break;
+	case TPT_ERR_MSN:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_IRD_OVERFLOW:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_RANGE;
+		break;
+	case TPT_ERR_TBIT:
+		*layer_type = LAYER_DDP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_MO:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MO;
+		break;
+	default: 
+		*layer_type = LAYER_RDMAP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	}
+}
+
+/*
+ * This posts a TERMINATE with layer=RDMA, type=catastrophic.
+ */
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
+{
+	union t3_wr *wqe;
+	struct terminate_message *term;
+	int status;
+	int tagged = 0;
+	struct sk_buff *skb;
+
+	PDBG("%s %d\n", __FUNCTION__, __LINE__);
+	skb = alloc_skb(40, GFP_ATOMIC);
+	if (!skb) {
+		printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (union t3_wr *)skb_put(skb, 40);
+	memset(wqe, 0, 40);
+	wqe->send.rdmaop = T3_TERMINATE;
+	
+	/* immediate data length */
+	wqe->send.plen = htonl(4);
+
+	/* immediate data starts here. */
+	term = (struct terminate_message *)wqe->send.sgl;
+	if (rsp_msg) {
+		status = CQE_STATUS(rsp_msg->cqe);
+		if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)
+			tagged = 1;
+		if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) ||
+		    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP))
+			tagged = 2;
+	} else {
+		status = TPT_ERR_INTERNAL_ERR;
+	}
+	build_term_codes(status, &term->layer_etype, &term->ecode, tagged);
+	build_fw_riwrh((void *)wqe, T3_WR_SEND, 
+		       T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1, 
+		       qhp->ep->hwtid, 5);
+	skb->priority = CPL_PRIORITY_DATA;
+	return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb));
+}
+
+/*
+ * Assumes qhp lock is held.
+ */
+static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	struct iwch_cq *rchp, *schp;
+	int count;
+
+	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
+	schp = get_chp(qhp->rhp, qhp->attr.scq);
+	
+	PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp);
+	/* take a ref on the qhp since we must release the lock */
+	atomic_inc(&qhp->refcnt);
+	spin_unlock_irqrestore(&qhp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&rchp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&rchp->cq);
+	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
+	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&rchp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&schp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&schp->cq);
+	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
+	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&schp->lock, *flag);
+
+	/* deref */
+	if (atomic_dec_and_test(&qhp->refcnt))
+                wake_up(&qhp->wait);
+
+	spin_lock_irqsave(&qhp->lock, *flag);
+}
+
+static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	if (t3b_device(qhp->rhp))
+		cxio_set_wq_in_error(&qhp->wq);
+	else
+		__flush_qp(qhp, flag);
+}
+
+
+/* 
+ * Return non zero if at least one RECV was pre-posted.
+ */
+static inline int rqes_posted(struct iwch_qp *qhp)
+{ 
+	return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV);
+}
+
+static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs)
+{
+	struct t3_rdma_init_attr init_attr;
+	int ret;
+
+	init_attr.tid = qhp->ep->hwtid;
+	init_attr.qpid = qhp->wq.qpid;
+	init_attr.pdid = qhp->attr.pd;
+	init_attr.scqid = qhp->attr.scq;
+	init_attr.rcqid = qhp->attr.rcq;
+	init_attr.rq_addr = qhp->wq.rq_addr;
+	init_attr.rq_size = 1 << qhp->wq.rq_size_log2;
+	init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE | 
+		qhp->attr.mpa_attr.recv_marker_enabled |
+		(qhp->attr.mpa_attr.xmit_marker_enabled << 1) |
+		(qhp->attr.mpa_attr.crc_enabled << 2);
+
+	/* 
+	 * XXX - The IWCM doesn't quite handle getting these
+ 	 * attrs set before going into RTS.  For now, just turn 
+	 * them on always...
+	 */
+#if 0
+	init_attr.qpcaps = qhp->attr.enableRdmaRead |
+		(qhp->attr.enableRdmaWrite << 1) |
+		(qhp->attr.enableBind << 2) |
+		(qhp->attr.enable_stag0_fastreg << 3) |
+		(qhp->attr.enable_stag0_fastreg << 4);
+#else
+	init_attr.qpcaps = 0x1f;
+#endif
+	init_attr.tcp_emss = qhp->ep->emss;
+	init_attr.ord = qhp->attr.max_ord;
+	init_attr.ird = qhp->attr.max_ird;
+	init_attr.qp_dma_addr = qhp->wq.dma_addr;
+	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
+	init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0;
+	PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d "
+	     "flags 0x%x qpcaps 0x%x\n", __FUNCTION__, 
+	     init_attr.rq_addr, init_attr.rq_size, 
+	     init_attr.flags, init_attr.qpcaps);
+	ret = cxio_rdma_init(&rhp->rdev, &init_attr);
+	PDBG("%s ret %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal)
+{
+	int ret = 0;
+	struct iwch_qp_attributes newattr = qhp->attr;
+	unsigned long flag;
+	int disconnect = 0;
+	int terminate = 0;
+	int abort = 0;
+	int free = 0;
+	struct iwch_ep *ep = NULL;
+
+	PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__, 
+	     qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state, 
+	     (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1);
+
+	spin_lock_irqsave(&qhp->lock, flag);
+
+	/* Process attr changes if in IDLE */
+	if (mask & IWCH_QP_ATTR_VALID_MODIFY) {
+		if (qhp->attr.state != IWCH_QP_STATE_IDLE) {
+			ret = -EIO;
+			goto out;
+		}
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ)
+			newattr.enable_rdma_read = attrs->enable_rdma_read;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE)
+			newattr.enable_rdma_write = attrs->enable_rdma_write;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND)
+			newattr.enable_bind = attrs->enable_bind;
+		if (mask & IWCH_QP_ATTR_MAX_ORD) {
+			if (attrs->max_ord > 
+			    rhp->attr.max_rdma_read_qp_depth) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ord = attrs->max_ord;
+		}
+		if (mask & IWCH_QP_ATTR_MAX_IRD) {
+			if (attrs->max_ird > 
+		  	    rhp->attr.max_rdma_reads_per_qp) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ird = attrs->max_ird;
+		}
+		qhp->attr = newattr;
+	}
+	
+	if (!(mask & IWCH_QP_ATTR_NEXT_STATE)) 
+		goto out;
+	if (qhp->attr.state == attrs->next_state)
+		goto out;
+
+	switch (qhp->attr.state) {
+	case IWCH_QP_STATE_IDLE:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_RTS: 
+			if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			qhp->attr.mpa_attr = attrs->mpa_attr;
+			qhp->attr.llp_stream_handle = attrs->llp_stream_handle;
+			qhp->ep = qhp->attr.llp_stream_handle;
+			qhp->attr.state = IWCH_QP_STATE_RTS;
+
+			/*
+			 * Ref the endpoint here and deref when we
+	 		 * disassociate the endpoint from the QP.  This
+			 * happens in CLOSING->IDLE transition or *->ERROR
+			 * transition.
+			 */
+			get_ep(&qhp->ep->com);
+			spin_unlock_irqrestore(&qhp->lock, flag);
+			ret = rdma_init(rhp, qhp, mask, attrs);
+			spin_lock_irqsave(&qhp->lock, flag);
+			if (ret)
+				goto err;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			flush_qp(qhp, &flag);
+			break;
+		default:
+			ret = -EINVAL;	
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_RTS:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_CLOSING:
+			BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2);
+			qhp->attr.state = IWCH_QP_STATE_CLOSING;
+			if (!internal) {
+				abort=0;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			break;
+		case IWCH_QP_STATE_TERMINATE:
+			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
+			if (!internal) 
+				terminate = 1;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			if (!internal) {
+				abort=1;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			goto err;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_CLOSING:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (attrs->next_state) {
+			case IWCH_QP_STATE_IDLE:
+				qhp->attr.state = IWCH_QP_STATE_IDLE;
+				qhp->attr.llp_stream_handle = NULL;
+				put_ep(&qhp->ep->com);
+				qhp->ep = NULL;
+				wake_up(&qhp->wait);
+				break;
+			case IWCH_QP_STATE_ERROR:
+				goto err;
+			default:
+				ret = -EINVAL;
+				goto err;
+		}
+		break;
+	case IWCH_QP_STATE_ERROR:
+		if (attrs->next_state != IWCH_QP_STATE_IDLE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		
+		if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) || 
+		    !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		qhp->attr.state = IWCH_QP_STATE_IDLE;
+		memset(&qhp->attr, 0, sizeof(qhp->attr));
+		break;
+	case IWCH_QP_STATE_TERMINATE:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		goto err;
+		break;
+	default:
+		printk(KERN_ERR "%s in a bad state %d\n", 
+		       __FUNCTION__, qhp->attr.state);
+		ret = -EINVAL;
+		goto err;
+		break;
+	}
+	goto out;
+err:
+	PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep, 
+	     qhp->wq.qpid);
+
+	/* disassociate the LLP connection */
+	qhp->attr.llp_stream_handle = NULL;
+	ep = qhp->ep;
+	qhp->ep = NULL;
+	qhp->attr.state = IWCH_QP_STATE_ERROR;
+	free=1;
+	wake_up(&qhp->wait);
+	BUG_ON(!ep);
+	flush_qp(qhp, &flag);
+out:
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	if (terminate)
+		iwch_post_terminate(qhp, NULL);
+
+	/*
+	 * If disconnect is 1, then we need to initiate a disconnect
+	 * on the EP.  This can be a normal close (RTS->CLOSING) or
+	 * an abnormal close (RTS/CLOSING->ERROR).
+	 */
+	if (disconnect)
+		iwch_ep_disconnect(ep, abort, GFP_KERNEL);
+
+	/* 
+	 * If free is 1, then we've disassociated the EP from the QP 
+	 * and we need to dereference the EP.
+	 */
+	if (free)
+		put_ep(&ep->com);
+
+	PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state);
+	return ret;
+}
+
+static int quiesce_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_quiesce_tid(qhp->ep);
+	qhp->flags |= QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+static int resume_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_resume_tid(qhp->ep);
+	qhp->flags &= ~QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+int iwch_quiesce_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) {
+			quiesce_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp)) 
+			quiesce_qp(qhp);
+	}
+	return 0;
+}
+
+int iwch_resume_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) {
+			resume_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp))
+			resume_qp(qhp);
+	}
+	return 0;
+}


From swise at opengridcomputing.com  Thu Dec 14 05:55:36 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:55:36 -0600
Subject: [openib-general] [PATCH  v4 06/13] Completion Queues
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135536.21159.74057.stgit@dell3.ogc.int>


Functions to manipulate CQs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 0000000..9d82df4
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 	0 			EMPTY;
+ *	1			cqe returned
+ *	-EAGAIN 		caller must try again
+ * 	any other -errno	fatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+		     struct ib_wc *wc)
+{
+	struct iwch_qp *qhp = NULL;
+	struct t3_cqe cqe, *rd_cqe;
+	struct t3_wq *wq;
+	u32 credit = 0;
+	u8 cqe_flushed;
+	u64 cookie;
+	int ret = 1;
+
+	rd_cqe = cxio_next_cqe(&chp->cq);
+
+	if (!rd_cqe)
+		return 0;
+
+	qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+	if (!qhp)
+		wq = NULL;
+	else {
+		spin_lock(&qhp->lock);
+		wq = &(qhp->wq);
+	}
+	ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie,
+				   &credit);
+	if (t3a_device(chp->rhp) && credit) {
+		PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__, 
+		     credit, chp->cq.cqid);
+		cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit);
+	}
+
+	if (ret) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = 1;
+
+	wc->wr_id = cookie;
+	wc->qp_num = qhp->wq.qpid;
+	wc->vendor_err = CQE_STATUS(cqe);
+
+	PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x "
+	     "lo 0x%x cookie 0x%llx\n", __FUNCTION__, 
+	     CQE_QPID(cqe), CQE_TYPE(cqe),
+	     CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+	     CQE_WRID_LOW(cqe), cookie);
+
+	if (CQE_TYPE(cqe) == 0) {
+		if (!CQE_STATUS(cqe))
+			wc->byte_len = CQE_LEN(cqe);
+		else
+			wc->byte_len = 0;
+		wc->opcode = IB_WC_RECV;
+	} else {
+		switch (CQE_OPCODE(cqe)) {
+		case T3_RDMA_WRITE:
+			wc->opcode = IB_WC_RDMA_WRITE;
+			break;
+		case T3_READ_REQ:
+			wc->opcode = IB_WC_RDMA_READ;
+			wc->byte_len = CQE_LEN(cqe);
+			break;
+		case T3_SEND:
+		case T3_SEND_WITH_SE:
+			wc->opcode = IB_WC_SEND;
+			break;
+		case T3_BIND_MW:
+			wc->opcode = IB_WC_BIND_MW;
+			break;
+
+		/* these aren't supported yet */
+		case T3_SEND_WITH_INV:
+		case T3_SEND_WITH_SE_INV:
+		case T3_LOCAL_INV:
+		case T3_FAST_REGISTER:
+		default:
+			printk(KERN_ERR MOD "Unexpected opcode %d "
+			       "in the CQE received for QPID=0x%0x\n", 
+			       CQE_OPCODE(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (cqe_flushed)
+		wc->status = IB_WC_WR_FLUSH_ERR;
+	else {
+		
+		switch (CQE_STATUS(cqe)) {
+		case TPT_ERR_SUCCESS:
+			wc->status = IB_WC_SUCCESS;
+			break;
+		case TPT_ERR_STAG:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_PDID:
+			wc->status = IB_WC_LOC_PROT_ERR;
+			break;
+		case TPT_ERR_QPID:
+		case TPT_ERR_ACCESS:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_WRAP:
+			wc->status = IB_WC_GENERAL_ERR;
+			break;
+		case TPT_ERR_BOUND:
+			wc->status = IB_WC_LOC_LEN_ERR;
+			break;
+		case TPT_ERR_INVALIDATE_SHARED_MR:
+		case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+			wc->status = IB_WC_MW_BIND_ERR;
+			break;
+		case TPT_ERR_CRC:
+		case TPT_ERR_MARKER:
+		case TPT_ERR_PDU_LEN_ERR:
+		case TPT_ERR_OUT_OF_RQE:
+		case TPT_ERR_DDP_VERSION:
+		case TPT_ERR_RDMA_VERSION:
+		case TPT_ERR_DDP_QUEUE_NUM:
+		case TPT_ERR_MSN:
+		case TPT_ERR_TBIT:
+		case TPT_ERR_MO:
+		case TPT_ERR_MSN_RANGE:
+		case TPT_ERR_IRD_OVERFLOW:
+		case TPT_ERR_OPCODE:
+			wc->status = IB_WC_FATAL_ERR;
+			break;
+		case TPT_ERR_SWFLUSH:
+			wc->status = IB_WC_WR_FLUSH_ERR;
+			break;
+		default:
+			printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for "
+			       "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+		}
+	}
+out:
+	if (wq)
+		spin_unlock(&qhp->lock);
+	return ret;
+}
+
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	unsigned long flags;
+	int npolled;
+	int err = 0;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+
+	spin_lock_irqsave(&chp->lock, flags);
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+#ifdef DEBUG
+		int i=0;
+#endif
+
+		/*
+	 	 * Because T3 can post CQEs that are _not_ associated
+	 	 * with a WR, we might have to poll again after removing
+	 	 * one of these.  
+		 */
+		do {
+			err = iwch_poll_cq_one(rhp, chp, wc + npolled);
+#ifdef DEBUG
+			BUG_ON(++i > 1000);
+#endif
+		} while (err == -EAGAIN);
+		if (err <= 0)
+			break;
+	}
+	spin_unlock_irqrestore(&chp->lock, flags);
+
+	if (err < 0)
+		return err;
+	else {
+		return npolled;
+	}
+}
+
+int iwch_modify_cq(struct ib_cq *cq, int cqe)
+{
+	PDBG("iwch_modify_cq: TBD\n");
+	return 0;
+}


From swise at opengridcomputing.com  Thu Dec 14 05:56:06 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:56:06 -0600
Subject: [openib-general] [PATCH  v4 07/13] Async Event Handler
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135606.21159.29525.stgit@dell3.ogc.int>


Code to handle async events coming from the T3 RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 0000000..b0bd014
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <net/sock.h>
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+			  struct respQ_msg_t *rsp_msg,
+			  enum ib_event_type ib_event, 
+			  int send_term)
+{
+	struct ib_event event;
+	struct iwch_qp_attributes attrs;
+	struct iwch_qp *qhp;
+
+	printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+	       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+	       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+	       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+	       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+	spin_lock(&rnicp->lock);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+
+	if (!qhp) {
+		printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n", 
+		       __FUNCTION__, CQE_STATUS(rsp_msg->cqe), 
+		       CQE_QPID(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	if ((qhp->attr.state == IWCH_QP_STATE_ERROR) ||
+	    (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) {
+		PDBG("%s AE received after RTS - "
+		     "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__, 
+		     qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	atomic_inc(&qhp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	event.event = ib_event;
+	event.device = chp->ibcq.device;
+	if (ib_event == IB_EVENT_CQ_ERR)
+		event.element.cq = &chp->ibcq;
+	else 
+		event.element.qp = &qhp->ibqp;
+
+	if (qhp->ibqp.event_handler)
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_TERMINATE;
+		iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, 
+			       &attrs, 1);
+		if (send_term)
+			iwch_post_terminate(qhp, rsp_msg);
+	} 
+
+	if (atomic_dec_and_test(&qhp->refcnt))
+		wake_up(&qhp->wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+	struct iwch_dev *rnicp;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	struct iwch_cq *chp;
+	struct iwch_qp *qhp;
+	u32 cqid = RSPQ_CQID(rsp_msg);
+
+	rnicp = (struct iwch_dev *) rdev_p->ulp;
+	spin_lock(&rnicp->lock);
+	chp = get_chp(rnicp, cqid);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+	if (!chp || !qhp) {
+		printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d "
+		       "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n", 
+		       cqid, CQE_QPID(rsp_msg->cqe), 
+		       CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+		       CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe), 
+		       CQE_WRID_LOW(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		goto out;
+	}
+	iwch_qp_add_ref(&qhp->ibqp);
+	atomic_inc(&chp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	/* 
+	 * 1) completion of our sending a TERMINATE.
+	 * 2) incoming TERMINATE message.  
+	 */
+	if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) && 
+	    (CQE_STATUS(rsp_msg->cqe) == 0)) {
+		if (SQ_TYPE(rsp_msg->cqe)) {
+			PDBG("%s QPID 0x%x ep %p disconnecting\n", 
+			     __FUNCTION__, qhp->wq.qpid, qhp->ep);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		} else {
+			PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__, 
+			     qhp->wq.qpid);
+			post_qp_event(rnicp, chp, rsp_msg, 
+				      IB_EVENT_QP_REQ_ERR, 0);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		}
+		goto done;
+	}
+
+	/* Bad incoming Read request */
+	if (SQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	/* Bad incoming write */
+	if (RQ_TYPE(rsp_msg->cqe) && 
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	switch (CQE_STATUS(rsp_msg->cqe)) {
+
+	/* Completion Events */
+	case TPT_ERR_SUCCESS:
+
+		/* 
+		 * Confirm the destination entry if this is a RECV completion.
+		 */
+		if (qhp->ep && SQ_TYPE(rsp_msg->cqe))
+			dst_confirm(qhp->ep->dst);
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		break;
+
+	case TPT_ERR_STAG:
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+	case TPT_ERR_WRAP:
+	case TPT_ERR_BOUND:
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x "
+		       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__, 
+		       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe), 
+		       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+		       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1);
+		break;
+
+	/* Device Fatal Errors */
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR: 
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1);
+		break;
+	
+	/* QP Fatal Errors */
+	case TPT_ERR_OUT_OF_RQE:
+	case TPT_ERR_PBL_ADDR_BOUND:
+	case TPT_ERR_CRC:
+	case TPT_ERR_MARKER:
+	case TPT_ERR_PDU_LEN_ERR:
+	case TPT_ERR_DDP_VERSION:
+	case TPT_ERR_RDMA_VERSION:
+	case TPT_ERR_OPCODE:
+	case TPT_ERR_DDP_QUEUE_NUM:
+	case TPT_ERR_MSN:
+	case TPT_ERR_TBIT:
+	case TPT_ERR_MO:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_RQE_ADDR_BOUND:
+	case TPT_ERR_IRD_OVERFLOW:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+
+	default:
+		printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n", 
+		       CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+	}
+done:
+	if (atomic_dec_and_test(&chp->refcnt))
+                wake_up(&chp->wait);
+	iwch_qp_rem_ref(&qhp->ibqp);
+out:
+	dev_kfree_skb_irq(skb);
+}


From swise at opengridcomputing.com  Thu Dec 14 05:56:37 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:56:37 -0600
Subject: [openib-general] [PATCH  v4 08/13] Memory Registration
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135636.21159.34359.stgit@dell3.ogc.int>


Functions to register memory regions.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  170 ++++++++++++++++++++++++++++++++
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 0000000..774d11e
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	if (cxio_register_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	/* We could support this... */
+	if (npages > mhp->attr.pbl_size)
+		return -ENOMEM;
+
+	stag = mhp->attr.stag;
+	if (cxio_reregister_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid); 
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list)
+{
+	u64 mask;
+	int i, j, n;
+
+	mask = 0;
+	*total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+			return -EINVAL;
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return -EINVAL;
+		*total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	if (*total_size > 0xFFFFFFFFULL)
+		return -ENOMEM;
+
+	/* Find largest page shift we can use to cover buffers */
+	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
+		if (num_phys_buf > 1) {
+			if ((1ULL << *shift) & mask)
+				break;
+		} else 
+			if (1ULL << *shift >=
+			    buffer_list[0].size +
+			    (buffer_list[0].addr & ((1ULL << *shift) - 1)))
+				break;
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
+	buffer_list[0].addr &= ~0ull << *shift;
+
+	*npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		*npages += (buffer_list[i].size + 
+			(1ULL << *shift) - 1) >> *shift;
+
+	if (!*npages)
+		return -EINVAL;
+
+	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
+	if (!*page_list)
+		return -ENOMEM;
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
+		     ++j) 
+			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
+			    ((u64) j << *shift));
+
+	PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n",
+	     __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages);
+
+	return 0;
+
+}


From swise at opengridcomputing.com  Thu Dec 14 05:57:07 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:57:07 -0600
Subject: [openib-general] [PATCH  v4 09/13] Core WQE/CQE Types
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135707.21159.1944.stgit@dell3.ogc.int>


T3 WQE and CQE structures, defines, etc...

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  685 ++++++++++++++++++++++++++++
 1 files changed, 685 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 0000000..45870be
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,685 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include <asm/io.h>
+#include <linux/pci.h>
+#include <linux/timer.h>
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE      4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+				       ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<<size_log2)-((wptr)-(rptr)))
+#define Q_COUNT(rptr,wptr) ((wptr)-(rptr))
+#define Q_PTR2IDX(ptr,size_log2) (ptr & ((1UL<<size_log2)-1))
+
+static inline void ring_doorbell(void __iomem *doorbell, u32 qpid) 
+{
+	writel(((1<<31) | qpid), doorbell);
+}
+
+#define SEQ32_GE(x,y) (!( (((u32) (x)) - ((u32) (y))) & 0x80000000 ))
+
+enum t3_wr_flags {
+	T3_COMPLETION_FLAG = 0x01,
+	T3_NOTIFY_FLAG = 0x02,
+	T3_SOLICITED_EVENT_FLAG = 0x04,
+	T3_READ_FENCE_FLAG = 0x08,
+	T3_LOCAL_FENCE_FLAG = 0x10
+} __attribute__ ((packed));
+
+enum t3_wr_opcode {
+	T3_WR_BP = FW_WROPCODE_RI_BYPASS,
+	T3_WR_SEND = FW_WROPCODE_RI_SEND,
+	T3_WR_WRITE = FW_WROPCODE_RI_RDMA_WRITE,
+	T3_WR_READ = FW_WROPCODE_RI_RDMA_READ,
+	T3_WR_INV_STAG = FW_WROPCODE_RI_LOCAL_INV,
+	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
+	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
+	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+} __attribute__ ((packed));
+
+enum t3_rdma_opcode {
+	T3_RDMA_WRITE,		/* IETF RDMAP v1.0 ... */
+	T3_READ_REQ,
+	T3_READ_RESP,
+	T3_SEND,
+	T3_SEND_WITH_INV,
+	T3_SEND_WITH_SE,
+	T3_SEND_WITH_SE_INV,
+	T3_TERMINATE,
+	T3_RDMA_INIT,		/* CHELSIO RI specific ... */
+	T3_BIND_MW,
+	T3_FAST_REGISTER,
+	T3_LOCAL_INV,
+	T3_QP_MOD,
+	T3_BYPASS
+} __attribute__ ((packed));
+
+static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
+{
+	switch (wrop) {
+		case T3_WR_BP: return T3_BYPASS;
+		case T3_WR_SEND: return T3_SEND;
+		case T3_WR_WRITE: return T3_RDMA_WRITE;
+		case T3_WR_READ: return T3_READ_REQ;
+		case T3_WR_INV_STAG: return T3_LOCAL_INV;
+		case T3_WR_BIND: return T3_BIND_MW;
+		case T3_WR_INIT: return T3_RDMA_INIT;
+		case T3_WR_QP_MOD: return T3_QP_MOD;
+		default: break;
+	}
+	return -1;
+}
+
+
+/* Work request id */
+union t3_wrid {
+	struct {
+		u32 hi;
+		u32 low;
+	} id0;
+	u64 id1;
+};
+
+#define WRID(wrid)      	(wrid.id1)
+#define WRID_GEN(wrid)		(wrid.id0.wr_gen)
+#define WRID_IDX(wrid)		(wrid.id0.wr_idx)
+#define WRID_LO(wrid)		(wrid.id0.wr_lo)
+
+struct fw_riwrh {
+	__be32 op_seop_flags;
+	__be32 gen_tid_len;
+};
+
+#define S_FW_RIWR_OP		24
+#define M_FW_RIWR_OP		0xff
+#define V_FW_RIWR_OP(x)		((x) << S_FW_RIWR_OP)
+#define G_FW_RIWR_OP(x)   	((((x) >> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP	22
+#define M_FW_RIWR_SOPEOP	0x3
+#define V_FW_RIWR_SOPEOP(x)	((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS		8
+#define M_FW_RIWR_FLAGS		0x3fffff
+#define V_FW_RIWR_FLAGS(x)	((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x)   	((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID		8
+#define V_FW_RIWR_TID(x)	((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN		0
+#define V_FW_RIWR_LEN(x)	((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN           31
+#define V_FW_RIWR_GEN(x)        ((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+	__be32 stag;
+	__be32 len;
+	__be64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;	
+	__be32 plen;		/* 3 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
+};
+
+struct t3_local_inv_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 stag;		/* 2 */
+	__be32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 stag_sink;
+	__be64 to_sink;		/* 3 */
+	__be32 plen;		/* 4 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 5+ */
+};
+
+struct t3_rdma_read_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;
+	__be64 rem_to;		/* 3 */
+	__be32 local_stag;	/* 4 */
+	__be32 local_len;
+	__be64 local_to;	/* 5 */
+};
+
+enum t3_addr_type {
+	T3_VA_BASED_TO = 0x0,
+	T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+	T3_MEM_ACCESS_LOCAL_READ = 0x1,
+	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+	T3_MEM_ACCESS_REM_READ = 0x4,
+	T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u16 reserved;		/* 2 */
+	u8 type;
+	u8 perms;
+	__be32 mr_stag;
+	__be32 mw_stag;		/* 3 */
+	__be32 mw_len;
+	__be64 mw_va;		/* 4 */
+	__be32 mr_pbl_addr;	/* 5 */
+	u8 reserved2[3];
+	u8 mr_pagesz;
+};
+
+struct t3_receive_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 pagesz[T3_MAX_SGE];
+	__be32 num_sgle;		/* 2 */
+	struct t3_sge sgl[T3_MAX_SGE];	/* 3+ */
+	__be32 pbl_addr[T3_MAX_SGE];
+};
+
+struct t3_bypass_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;	/* 1 */
+};
+
+struct t3_modify_qp_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 flags;		/* 2 */
+	__be32 quiesce;		/* 2 */
+	__be32 max_ird;		/* 3 */
+	__be32 max_ord;		/* 3 */
+	__be64 sge_cmd;		/* 4 */
+	__be64 ctx1;		/* 5 */
+	__be64 ctx0;		/* 6 */
+};
+
+enum t3_modify_qp_flags {
+	MODQP_QUIESCE  = 0x01,
+	MODQP_MAX_IRD  = 0x02,
+	MODQP_MAX_ORD  = 0x04,
+	MODQP_WRITE_EC = 0x08,
+	MODQP_READ_EC  = 0x10,
+};
+	
+
+enum t3_mpa_attrs {
+	uP_RI_MPA_RX_MARKER_ENABLE = 0x1,
+	uP_RI_MPA_TX_MARKER_ENABLE = 0x2,
+	uP_RI_MPA_CRC_ENABLE = 0x4,
+	uP_RI_MPA_IETF_ENABLE = 0x8
+} __attribute__ ((packed));
+
+enum t3_qp_caps {
+	uP_RI_QP_RDMA_READ_ENABLE = 0x01,
+	uP_RI_QP_RDMA_WRITE_ENABLE = 0x02,
+	uP_RI_QP_BIND_ENABLE = 0x04,
+	uP_RI_QP_FAST_REGISTER_ENABLE = 0x08,
+	uP_RI_QP_STAG0_ENABLE = 0x10
+} __attribute__ ((packed));
+
+struct t3_rdma_init_attr {
+	u32 tid;
+	u32 qpid;
+	u32 pdid;
+	u32 scqid;
+	u32 rcqid;
+	u32 rq_addr;
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs;
+	enum t3_qp_caps qpcaps;
+	u16 tcp_emss;
+	u32 ord;
+	u32 ird;
+	u64 qp_dma_addr;
+	u32 qp_dma_size;
+	u32 flags;
+};
+
+struct t3_rdma_init_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 qpid;		/* 2 */
+	__be32 pdid;
+	__be32 scqid;		/* 3 */
+	__be32 rcqid;
+	__be32 rq_addr;		/* 4 */
+	__be32 rq_size;
+	u8 mpaattrs;		/* 5 */
+	u8 qpcaps;
+	__be16 ulpdu_size;
+	__be32 flags;		/* bits 31-1 - reservered */
+				/* bit     0 - set if RECV posted */
+	__be32 ord;		/* 6 */
+	__be32 ird;
+	__be64 qp_dma_addr;	/* 7 */
+	__be32 qp_dma_size;	/* 8 */
+	u32 rsvd;
+};
+
+struct t3_genbit {
+	u64 flit[15];
+	__be64 genbit;
+};
+
+enum rdma_init_wr_flags {
+	RECVS_POSTED = 1,
+};
+
+union t3_wr {
+	struct t3_send_wr send;
+	struct t3_rdma_write_wr write;
+	struct t3_rdma_read_wr read;
+	struct t3_receive_wr recv;
+	struct t3_local_inv_wr local_inv;
+	struct t3_bind_mw_wr bind;
+	struct t3_bypass_wr bypass;
+	struct t3_rdma_init_wr init;
+	struct t3_modify_qp_wr qp_mod;
+	struct t3_genbit genbit;
+	u64 flit[16];
+};
+
+#define T3_SQ_CQE_FLIT 	  13
+#define T3_SQ_COOKIE_FLIT 14
+
+#define T3_RQ_COOKIE_FLIT 13
+#define T3_RQ_CQE_FLIT 	  14
+
+static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
+{
+	return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags));
+}
+
+static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
+				  enum t3_wr_flags flags, u8 genbit, u32 tid,
+				  u8 len)
+{
+	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
+					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_FLAGS(flags));
+	wmb();
+	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
+				       V_FW_RIWR_TID(tid) |
+				       V_FW_RIWR_LEN(len));
+	/* 2nd gen bit... */
+        ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit);
+}
+
+/*
+ * T3 ULP2_TX commands
+ */
+enum t3_utx_mem_op {
+	T3_UTX_MEM_READ = 2,
+	T3_UTX_MEM_WRITE = 3
+};
+
+/* T3 MC7 RDMA TPT entry format */
+
+enum tpt_mem_type {
+	TPT_NON_SHARED_MR = 0x0,
+	TPT_SHARED_MR = 0x1,
+	TPT_MW = 0x2,
+	TPT_MW_RELAXED_PROTECTION = 0x3
+};
+
+enum tpt_addr_type {
+	TPT_ZBTO = 0,
+	TPT_VATO = 1
+};
+
+enum tpt_mem_perm {
+	TPT_LOCAL_READ = 0x8,
+	TPT_LOCAL_WRITE = 0x4,
+	TPT_REMOTE_READ = 0x2,
+	TPT_REMOTE_WRITE = 0x1
+};
+
+struct tpt_entry {
+	__be32 valid_stag_pdid;
+	__be32 flags_pagesize_qpid;
+
+	__be32 rsvd_pbl_addr;
+	__be32 len;
+	__be32 va_hi;
+	__be32 va_low_or_fbo;
+
+	__be32 rsvd_bind_cnt_or_pstag;
+	__be32 rsvd_pbl_size;
+};
+
+#define S_TPT_VALID		31
+#define V_TPT_VALID(x)		((x) << S_TPT_VALID)
+#define F_TPT_VALID		V_TPT_VALID(1U)
+
+#define S_TPT_STAG_KEY		23
+#define M_TPT_STAG_KEY		0xFF
+#define V_TPT_STAG_KEY(x)	((x) << S_TPT_STAG_KEY)
+#define G_TPT_STAG_KEY(x)	(((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY)
+
+#define S_TPT_STAG_STATE	22
+#define V_TPT_STAG_STATE(x)	((x) << S_TPT_STAG_STATE)
+#define F_TPT_STAG_STATE	V_TPT_STAG_STATE(1U)
+
+#define S_TPT_STAG_TYPE		20
+#define M_TPT_STAG_TYPE		0x3
+#define V_TPT_STAG_TYPE(x)	((x) << S_TPT_STAG_TYPE)
+#define G_TPT_STAG_TYPE(x)	(((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE)
+
+#define S_TPT_PDID		0
+#define M_TPT_PDID		0xFFFFF
+#define V_TPT_PDID(x)		((x) << S_TPT_PDID)
+#define G_TPT_PDID(x)		(((x) >> S_TPT_PDID) & M_TPT_PDID)
+
+#define S_TPT_PERM		28
+#define M_TPT_PERM		0xF
+#define V_TPT_PERM(x)		((x) << S_TPT_PERM)
+#define G_TPT_PERM(x)		(((x) >> S_TPT_PERM) & M_TPT_PERM)
+
+#define S_TPT_REM_INV_DIS	27
+#define V_TPT_REM_INV_DIS(x)	((x) << S_TPT_REM_INV_DIS)
+#define F_TPT_REM_INV_DIS	V_TPT_REM_INV_DIS(1U)
+
+#define S_TPT_ADDR_TYPE		26
+#define V_TPT_ADDR_TYPE(x)	((x) << S_TPT_ADDR_TYPE)
+#define F_TPT_ADDR_TYPE		V_TPT_ADDR_TYPE(1U)
+
+#define S_TPT_MW_BIND_ENABLE	25
+#define V_TPT_MW_BIND_ENABLE(x)	((x) << S_TPT_MW_BIND_ENABLE)
+#define F_TPT_MW_BIND_ENABLE    V_TPT_MW_BIND_ENABLE(1U)
+
+#define S_TPT_PAGE_SIZE		20
+#define M_TPT_PAGE_SIZE		0x1F
+#define V_TPT_PAGE_SIZE(x)	((x) << S_TPT_PAGE_SIZE)
+#define G_TPT_PAGE_SIZE(x)	(((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE)
+
+#define S_TPT_PBL_ADDR		0
+#define M_TPT_PBL_ADDR		0x1FFFFFFF
+#define V_TPT_PBL_ADDR(x)	((x) << S_TPT_PBL_ADDR)
+#define G_TPT_PBL_ADDR(x)       (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR)
+
+#define S_TPT_QPID		0
+#define M_TPT_QPID		0xFFFFF
+#define V_TPT_QPID(x)		((x) << S_TPT_QPID)
+#define G_TPT_QPID(x)		(((x) >> S_TPT_QPID) & M_TPT_QPID)
+
+#define S_TPT_PSTAG		0
+#define M_TPT_PSTAG		0xFFFFFF
+#define V_TPT_PSTAG(x)		((x) << S_TPT_PSTAG)
+#define G_TPT_PSTAG(x)		(((x) >> S_TPT_PSTAG) & M_TPT_PSTAG)
+
+#define S_TPT_PBL_SIZE		0
+#define M_TPT_PBL_SIZE		0xFFFFF
+#define V_TPT_PBL_SIZE(x)	((x) << S_TPT_PBL_SIZE)
+#define G_TPT_PBL_SIZE(x)	(((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE)
+
+/*
+ * CQE defs
+ */
+struct t3_cqe {
+	__be32 header;
+	__be32 len;
+	union {
+		struct {
+			__be32 stag;
+			__be32 msn;
+		} rcqe;
+		struct {
+			u32 wrid_hi;	
+			u32 wrid_low;
+		} scqe;
+	} u;
+};
+
+#define S_CQE_OOO	  31
+#define M_CQE_OOO	  0x1
+#define G_CQE_OOO(x)	  ((((x) >> S_CQE_OOO)) & M_CQE_OOO)
+#define V_CEQ_OOO(x)	  ((x)<<S_CQE_OOO)
+
+#define S_CQE_QPID        12
+#define M_CQE_QPID        0x7FFFF
+#define G_CQE_QPID(x)     ((((x) >> S_CQE_QPID)) & M_CQE_QPID)
+#define V_CQE_QPID(x) 	  ((x)<<S_CQE_QPID)
+
+#define S_CQE_SWCQE       11
+#define M_CQE_SWCQE       0x1
+#define G_CQE_SWCQE(x)    ((((x) >> S_CQE_SWCQE)) & M_CQE_SWCQE)
+#define V_CQE_SWCQE(x) 	  ((x)<<S_CQE_SWCQE)
+
+#define S_CQE_GENBIT      10
+#define M_CQE_GENBIT      0x1
+#define G_CQE_GENBIT(x)   (((x) >> S_CQE_GENBIT) & M_CQE_GENBIT)
+#define V_CQE_GENBIT(x)	  ((x)<<S_CQE_GENBIT)
+
+#define S_CQE_STATUS      5
+#define M_CQE_STATUS      0x1F
+#define G_CQE_STATUS(x)   ((((x) >> S_CQE_STATUS)) & M_CQE_STATUS)
+#define V_CQE_STATUS(x)   ((x)<<S_CQE_STATUS)
+
+#define S_CQE_TYPE        4
+#define M_CQE_TYPE        0x1
+#define G_CQE_TYPE(x)     ((((x) >> S_CQE_TYPE)) & M_CQE_TYPE)
+#define V_CQE_TYPE(x)     ((x)<<S_CQE_TYPE)
+
+#define S_CQE_OPCODE      0
+#define M_CQE_OPCODE      0xF
+#define G_CQE_OPCODE(x)   ((((x) >> S_CQE_OPCODE)) & M_CQE_OPCODE)
+#define V_CQE_OPCODE(x)   ((x)<<S_CQE_OPCODE)
+
+#define SW_CQE(x)         (G_CQE_SWCQE(be32_to_cpu((x).header)))
+#define CQE_OOO(x)        (G_CQE_OOO(be32_to_cpu((x).header)))
+#define CQE_QPID(x)       (G_CQE_QPID(be32_to_cpu((x).header)))
+#define CQE_GENBIT(x)     (G_CQE_GENBIT(be32_to_cpu((x).header)))
+#define CQE_TYPE(x)       (G_CQE_TYPE(be32_to_cpu((x).header)))
+#define SQ_TYPE(x)	  (CQE_TYPE((x)))
+#define RQ_TYPE(x)	  (!CQE_TYPE((x)))
+#define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
+#define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
+
+#define CQE_LEN(x)        (be32_to_cpu((x).len))
+
+/* used for RQ completion processing */
+#define CQE_WRID_STAG(x)  (be32_to_cpu((x).u.rcqe.stag))
+#define CQE_WRID_MSN(x)   (be32_to_cpu((x).u.rcqe.msn))
+
+/* used for SQ completion processing */
+#define CQE_WRID_SQ_WPTR(x)	((x).u.scqe.wrid_hi)
+#define CQE_WRID_WPTR(x)   	((x).u.scqe.wrid_low)
+
+/* generic accessor macros */
+#define CQE_WRID_HI(x)		((x).u.scqe.wrid_hi)
+#define CQE_WRID_LOW(x) 	((x).u.scqe.wrid_low)
+
+#define TPT_ERR_SUCCESS                     0x0
+#define TPT_ERR_STAG                        0x1	 /* STAG invalid: either the */
+						 /* STAG is offlimt, being 0, */
+						 /* or STAG_key mismatch */
+#define TPT_ERR_PDID                        0x2	 /* PDID mismatch */
+#define TPT_ERR_QPID                        0x3	 /* QPID mismatch */
+#define TPT_ERR_ACCESS                      0x4	 /* Invalid access right */
+#define TPT_ERR_WRAP                        0x5	 /* Wrap error */
+#define TPT_ERR_BOUND                       0x6	 /* base and bounds voilation */
+#define TPT_ERR_INVALIDATE_SHARED_MR        0x7	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND 0x8	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_ECC                         0x9	 /* ECC error detected */
+#define TPT_ERR_ECC_PSTAG                   0xA	 /* ECC error detected when  */
+						 /* reading PSTAG for a MW  */
+						 /* Invalidate */
+#define TPT_ERR_PBL_ADDR_BOUND              0xB	 /* pbl addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_SWFLUSH			    0xC	 /* SW FLUSHED */
+#define TPT_ERR_CRC                         0x10 /* CRC error */
+#define TPT_ERR_MARKER                      0x11 /* Marker error */
+#define TPT_ERR_PDU_LEN_ERR                 0x12 /* invalid PDU length */
+#define TPT_ERR_OUT_OF_RQE                  0x13 /* out of RQE */
+#define TPT_ERR_DDP_VERSION                 0x14 /* wrong DDP version */
+#define TPT_ERR_RDMA_VERSION                0x15 /* wrong RDMA version */
+#define TPT_ERR_OPCODE                      0x16 /* invalid rdma opcode */
+#define TPT_ERR_DDP_QUEUE_NUM               0x17 /* invalid ddp queue number */
+#define TPT_ERR_MSN                         0x18 /* MSN error */
+#define TPT_ERR_TBIT                        0x19 /* tag bit not set correctly */
+#define TPT_ERR_MO                          0x1A /* MO not 0 for TERMINATE  */
+						 /* or READ_REQ */
+#define TPT_ERR_MSN_GAP                     0x1B
+#define TPT_ERR_MSN_RANGE                   0x1C
+#define TPT_ERR_IRD_OVERFLOW                0x1D
+#define TPT_ERR_RQE_ADDR_BOUND              0x1E /* RQE addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_INTERNAL_ERR                0x1F /* internal error (opcode  */
+						 /* mismatch) */
+
+struct t3_swsq {
+	__u64 			wr_id;
+	struct t3_cqe 		cqe;
+	__u32			sq_wptr;
+	__be32			read_len;
+	int 			opcode;
+	int			complete;
+	int			signaled;	
+};
+
+/*
+ * A T3 WQ implements both the SQ and RQ.
+ */
+struct t3_wq {
+	union t3_wr *queue;		/* DMA accessable memory */
+	dma_addr_t dma_addr;		/* DMA address for HW */
+	DECLARE_PCI_UNMAP_ADDR(mapping)	/* unmap kruft */
+	u32 error;			/* 1 once we go to ERROR */
+	u32 qpid;
+	u32 wptr;			/* idx to next available WR slot */
+	u32 size_log2;			/* total wq size */
+	struct t3_swsq *sq;		/* SW SQ */
+	struct t3_swsq *oldest_read;	/* tracks oldest pending read */
+	u32 sq_wptr;			/* sq_wptr - sq_rptr == count of */
+	u32 sq_rptr;			/* pending wrs */
+	u32 sq_size_log2;		/* sq size */
+	u64 *rq;			/* SW RQ (holds consumer wr_ids */
+	u32 rq_wptr;			/* rq_wptr - rq_rptr == count of */
+	u32 rq_rptr;			/* pending wrs */
+	u64 *rq_oldest_wr;		/* oldest wr on the SW RQ */
+	u32 rq_size_log2;		/* rq size */
+	u32 rq_addr;			/* rq adapter address */
+	void __iomem *doorbell;		/* kernel db */
+	u64 udb;			/* user db if any */
+};
+
+struct t3_cq {
+	u32 cqid;
+	u32 rptr;
+	u32 wptr;
+	u32 size_log2;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct t3_cqe *queue;
+	struct t3_cqe *sw_queue;
+	u32 sw_rptr;
+	u32 sw_wptr;
+};
+
+#define CQ_VLD_ENTRY(ptr,size_log2,cqe) (Q_GENBIT(ptr,size_log2) == \
+					 CQE_GENBIT(*cqe))
+
+static inline void cxio_set_wq_in_error(struct t3_wq *wq)
+{
+	wq->queue->flit[13] = 1;
+}
+
+static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:57:37 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:57:37 -0600
Subject: [openib-general] [PATCH  v4 10/13] Core HAL
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135737.21159.98294.stgit@dell3.ogc.int>


The RDMA Core interfaces with the T3 HW and ULLD providing a low level
RDMA interface.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  201 ++++
 2 files changed, 1503 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 0000000..ffc4ec0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1302 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/semaphore.h>
+#include <asm/delay.h>
+
+#include <linux/netdevice.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+#include "sge_defs.h"
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+							     *tdev)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (rdev_tbl[i]->t3cdev_p == tdev)
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (!rdev_tbl[i]) {
+			rdev_tbl[i] = rdev_p;
+			break;
+		}
+	return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i] == rdev_p) {
+			rdev_tbl[i] = NULL;
+			break;
+		}
+}
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq, 
+		   enum t3_cq_opcode op, u32 credit)
+{
+	int ret;
+	struct t3_cqe *cqe;
+	u32 rptr;
+
+	struct rdma_cq_op setup;
+	setup.id = cq->cqid;
+	setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+	setup.op = op;
+	ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup);
+
+	if ((ret < 0) || (op == CQ_CREDIT_UPDATE)) 
+		return ret;
+
+	/*
+	 * If the rearm returned an index other than our current index,
+	 * then there might be CQE's in flight (being DMA'd).  We must wait
+	 * here for them to complete or the consumer can miss a notification.
+	 */
+	if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+		int i=0;
+
+		rptr = cq->rptr;
+
+		/* 
+		 * Keep the generation correct by bumping rptr until it
+		 * matches the index returned by the rearm - 1.
+	 	 */
+		while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+			rptr++;
+
+		/* 
+		 * Now rptr is the index for the (last) cqe that was 
+	 	 * in-flight at the time the HW rearmed the CQ.  We 
+		 * spin until that CQE is valid.
+	 	 */
+		cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2);
+		while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) {
+			udelay(1);
+			if (i++ > 1000000) {
+				BUG_ON(1);
+				printk(KERN_ERR "%s: stalled rnic\n", 
+				       rdev_p->dev_name);
+				return -EIO;
+			}
+		}
+	}
+	return 0;
+}
+
+static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cqid;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 0;		/* disaable the CQ */
+	setup.credits = 0;
+	setup.credit_thres = 0;
+	setup.ovfl_mode = 0;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
+{
+	u64 sge_cmd;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = qpid << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe);
+
+	cq->cqid = cxio_hal_get_cqid(rdev_p->rscp);
+	if (!cq->cqid)
+		return -ENOMEM;
+	cq->sw_queue = kzalloc(size, GFP_KERNEL);
+	if (!cq->sw_queue)
+		return -ENOMEM;
+	cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     (1UL << (cq->size_log2)) *
+					     sizeof(struct t3_cqe),
+					     &(cq->dma_addr), GFP_KERNEL);
+	if (!cq->queue) {
+		kfree(cq->sw_queue);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(cq, mapping, cq->dma_addr);
+	memset(cq->queue, 0, size);
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = 65535;
+	setup.credit_thres = 1;
+	if (rdev_p->t3cdev_p->type == T3B)
+		setup.ovfl_mode = 0;
+	else
+		setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = setup.size;
+	setup.credit_thres = setup.size;	/* TBD: overflow recovery */
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	u32 qpid;
+	int i;
+
+	mutex_lock(&uctx->lock);
+	if (!list_empty(&uctx->qpids)) {
+		entry = list_entry(uctx->qpids.next, struct cxio_qpid_list, 
+				   entry);
+		list_del(&entry->entry);
+		qpid = entry->qpid;
+		kfree(entry);
+	} else {
+		qpid = cxio_hal_get_qpid(rdev_p->rscp);
+		if (!qpid) 
+			goto out;
+		for (i = qpid+1; i & rdev_p->qpmask; i++) {
+			entry = kmalloc(sizeof *entry, GFP_KERNEL);
+			if (!entry)
+				break;
+			entry->qpid = i;
+			list_add_tail(&entry->entry, &uctx->qpids);
+		}
+	}
+out:
+	mutex_unlock(&uctx->lock);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid, 
+		     struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	
+	entry = kmalloc(sizeof *entry, GFP_KERNEL);
+	if (!entry) 
+		return;
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	entry->qpid = qpid;
+	mutex_lock(&uctx->lock);
+	list_add_tail(&entry->entry, &uctx->qpids);
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct list_head *pos, *nxt;
+	struct cxio_qpid_list *entry;
+
+	mutex_lock(&uctx->lock);
+	list_for_each_safe(pos, nxt, &uctx->qpids) {
+		entry = list_entry(pos, struct cxio_qpid_list, entry);
+		list_del_init(&entry->entry);
+		if (!(entry->qpid & rdev_p->qpmask))
+			cxio_hal_put_qpid(rdev_p->rscp, entry->qpid);
+		kfree(entry);
+	}
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	INIT_LIST_HEAD(&uctx->qpids);
+	mutex_init(&uctx->lock);
+}
+
+int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain,
+		   struct t3_wq *wq, struct cxio_ucontext *uctx)
+{
+	int depth = 1UL << wq->size_log2;
+	int rqsize = 1UL << wq->rq_size_log2;
+
+	wq->qpid = get_qpid(rdev_p, uctx);
+	if (!wq->qpid)
+		return -ENOMEM;
+
+	wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL);
+	if (!wq->rq)
+		goto err1;
+
+	wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize);
+	if (!wq->rq_addr)
+		goto err2;
+
+	wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL);
+	if (!wq->sq)
+		goto err3;
+	
+	wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     depth * sizeof(union t3_wr),
+					     &(wq->dma_addr), GFP_KERNEL);
+	if (!wq->queue)
+		goto err4;
+
+	memset(wq->queue, 0, depth * sizeof(union t3_wr));
+	pci_unmap_addr_set(wq, mapping, wq->dma_addr);
+	wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	if (!kernel_domain)
+		wq->udb = (u64)rdev_p->rnic_info.udbell_physbase + 
+					(wq->qpid << rdev_p->qpshift);
+	PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__, 
+	     wq->qpid, wq->doorbell, wq->udb);
+	return 0;
+err4:
+	kfree(wq->sq);
+err3:
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize);
+err2:
+	kfree(wq->rq);
+err1:
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return -ENOMEM;
+}
+
+int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	int err;
+	err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid);
+	kfree(cq->sw_queue);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (cq->size_log2))
+			  * sizeof(struct t3_cqe), cq->queue, 
+			  pci_unmap_addr(cq, mapping));
+	cxio_hal_put_cqid(rdev_p->rscp, cq->cqid);
+	return err;
+}
+
+int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (wq->size_log2))
+			  * sizeof(union t3_wr), wq->queue, 
+			  pci_unmap_addr(wq, mapping));
+	kfree(wq->sq);
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2));
+	kfree(wq->rq);
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return 0;
+}
+
+static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(T3_SEND) | 
+		         	 V_CQE_TYPE(0) |
+		         	 V_CQE_SWCQE(1) |
+		         	 V_CQE_QPID(wq->qpid) | 
+		         	 V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	u32 ptr;
+
+	PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq);
+
+	/* flush RQ */
+	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__, 
+	    wq->rq_rptr, wq->rq_wptr, count);
+	ptr = wq->rq_rptr + count;
+	while (ptr++ != wq->rq_wptr)
+		insert_recv_cqe(wq, cq);
+}
+
+static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq, 
+		          struct t3_swsq *sqp)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__, 
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) | 
+			         V_CQE_OPCODE(sqp->opcode) |
+			         V_CQE_TYPE(1) |
+			         V_CQE_SWCQE(1) |
+			         V_CQE_QPID(wq->qpid) | 
+			         V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr, 
+						       cq->size_log2)));
+	cqe.u.scqe.wrid_hi = sqp->sq_wptr;
+
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	__u32 ptr;
+	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
+
+	ptr = wq->sq_rptr + count;
+	sqp += count;
+	while (ptr != wq->sq_wptr) {
+		insert_sq_cqe(wq, cq, sqp);
+		sqp++;
+		ptr++;
+	}
+}
+
+/* 
+ * Move all CQEs from the HWCQ into the SWCQ.
+ */
+void cxio_flush_hw_cq(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe, *swcqe;
+
+	PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid);
+	cqe = cxio_next_hw_cqe(cq);
+	while (cqe) {
+		PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n", 
+		     __FUNCTION__, cq->rptr, cq->sw_wptr);
+		swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2);
+		*swcqe = *cqe;
+		swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1));
+		cq->sw_wptr++;
+		cq->rptr++;
+		cqe = cxio_next_hw_cqe(cq);
+	}
+}
+
+static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq)
+{
+	if (CQE_OPCODE(*cqe) == T3_TERMINATE) 
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) &&
+	    Q_EMPTY(wq->rq_rptr, wq->rq_wptr))
+		return 0;
+
+	return 1;
+}
+
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) && 
+		    (CQE_QPID(*cqe) == wq->qpid))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	PDBG("%s count zero %d\n", __FUNCTION__, *count);
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) && 
+		    (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p)
+{
+	struct rdma_cq_setup setup;
+	setup.id = 0;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 1;		/* enable the CQ */
+	setup.credits = 0;
+
+	/* force SGE to redirect to RspQ and interrupt */
+	setup.credit_thres = 0;	
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	int err;
+	u64 sge_cmd, ctx0, ctx1;
+	u64 base_addr;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+
+
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	err = cxio_hal_init_ctrl_cq(rdev_p);
+	if (err) {
+		PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err);
+		return err;
+	}
+	rdev_p->ctrl_qp.workq = dma_alloc_coherent(
+					&(rdev_p->rnic_info.pdev->dev),
+					(1 << T3_CTRL_QP_SIZE_LOG2) *
+					sizeof(union t3_wr),
+					&(rdev_p->ctrl_qp.dma_addr), 
+					GFP_KERNEL);
+	if (!rdev_p->ctrl_qp.workq) {
+		PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping, 
+			   rdev_p->ctrl_qp.dma_addr);
+	rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	memset(rdev_p->ctrl_qp.workq, 0,
+	       (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr));
+
+	init_MUTEX(&rdev_p->ctrl_qp.sem);
+	init_waitqueue_head(&rdev_p->ctrl_qp.waitq);
+
+	/* update HW Ctrl QP context */
+	base_addr = rdev_p->ctrl_qp.dma_addr;
+	base_addr >>= 12;
+	ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) |
+		V_EC_BASE_LO((u32) base_addr & 0xffff));
+	ctx0 <<= 32;
+	ctx0 |= V_EC_CREDITS(FW_WR_NUM);
+	base_addr >>= 16;
+	ctx1 = (u32) base_addr;
+	base_addr >>= 32;
+	ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) |
+			V_EC_TYPE(0) | V_EC_GEN(1) |
+			V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32;
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1,
+		       T3_CTL_QP_TID, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = cpu_to_be64(ctx1);
+	wqe->ctx0 = cpu_to_be64(ctx0);
+	PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n",
+	     (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq,
+	     1 << T3_CTRL_QP_SIZE_LOG2);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << T3_CTRL_QP_SIZE_LOG2)
+			  * sizeof(union t3_wr), rdev_p->ctrl_qp.workq,
+			  pci_unmap_addr(&rdev_p->ctrl_qp, mapping));
+	return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID);
+}
+
+/* write len bytes of data into addr (32B aligned address) 
+ * If data is NULL, clear len byte of memory to zero.
+ * caller aquires the sem before the call
+ */
+static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
+				      u32 len, void *data, int completion)
+{
+	u32 i, nr_wqe, copy_len;
+	u8 *copy_data;
+	u8 wr_len, utx_len;	/* lenght in 8 byte flit */
+	enum t3_wr_flags flag;
+	__be64 *wqe;
+	u64 utx_cmd;
+	addr &= 0x7FFFFFF;
+	nr_wqe = len % 96 ? len / 96 + 1 : len / 96;	/* 96B max per WQE */
+	PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n",
+	     __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len, 
+	     nr_wqe, data, addr);
+	utx_len = 3;		/* in 32B unit */
+	for (i = 0; i < nr_wqe; i++) {
+		if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr,
+		           T3_CTRL_QP_SIZE_LOG2)) {
+			PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, "
+			     "wait for more space i %d\n", __FUNCTION__, 
+			     rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i);
+			if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     !Q_FULL(rdev_p->ctrl_qp.rptr,
+						     rdev_p->ctrl_qp.wptr,
+						     T3_CTRL_QP_SIZE_LOG2))) {
+				PDBG("%s ctrl_qp workq interrupted\n",
+				     __FUNCTION__);
+				return -ERESTARTSYS;
+			}
+			PDBG("%s ctrl_qp wakeup, continue posting work request "
+			     "i %d\n", __FUNCTION__, i);
+		}
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+						(1 << T3_CTRL_QP_SIZE_LOG2)));
+		flag = 0;
+		if (i == (nr_wqe - 1)) {
+			/* last WQE */
+			flag = completion ? T3_COMPLETION_FLAG : 0;
+			if (len % 32)
+				utx_len = len / 32 + 1;
+			else
+				utx_len = len / 32;
+		}
+
+		/* 
+		 * Force a CQE to return the credit to the workq in case 
+		 * we posted more than half the max QP size of WRs 
+		 */
+		if ((i != 0) && 
+		    (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) {
+			flag = T3_COMPLETION_FLAG;
+			PDBG("%s force completion at i %d\n", __FUNCTION__, i);
+		}
+
+		/* build the utx mem command */
+		wqe += (sizeof(struct t3_bypass_wr) >> 3);
+		utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3);
+		utx_cmd <<= 32;
+		utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1);
+		*wqe = cpu_to_be64(utx_cmd);
+		wqe++;
+		copy_data = (u8 *) data + i * 96;
+		copy_len = len > 96 ? 96 : len;
+
+		/* clear memory content if data is NULL */
+		if (data)
+			memcpy(wqe, copy_data, copy_len);
+		else
+			memset(wqe, 0, copy_len);
+		if (copy_len % 32)
+			memset(((u8 *) wqe) + copy_len, 0,
+			       32 - (copy_len % 32));
+		wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 + 
+			 (utx_len << 2);
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+
+		/* wptr in the WRID[31:0] */
+		((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr;
+
+		/* 
+		 * This must be the last write with a memory barrier 
+		 * for the genbit 
+		 */
+		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
+			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
+					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
+			       wr_len);
+		if (flag == T3_COMPLETION_FLAG)
+			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
+		len -= 96;
+		rdev_p->ctrl_qp.wptr++;
+	}
+	return 0;
+}
+
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
+ * OUT: stag index, actual pbl_size, pbl_addr allocated.
+ * TBD: shared memory region support
+ */
+static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
+			 u32 *stag, u8 stag_state, u32 pdid,
+			 enum tpt_mem_type type, enum tpt_mem_perm perm,
+			 u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl,
+			 u32 *pbl_size, u32 *pbl_addr)
+{
+	int err;
+	struct tpt_entry tpt;
+	u32 stag_idx;
+	u32 wptr;
+	int rereg = (*stag != T3_STAG_UNSET);
+
+	stag_state = stag_state > 0;
+	stag_idx = (*stag) >> 8;
+
+	if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) {
+		stag_idx = cxio_hal_get_stag(rdev_p->rscp);
+		if (!stag_idx)
+			return -ENOMEM;
+		*stag = (stag_idx << 8) | ((*stag) & 0xFF);
+	}
+	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n", 
+	     __FUNCTION__, stag_state, type, pdid, stag_idx);
+	
+	if (reset_tpt_entry) 
+		cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3);
+	else if (!rereg) {
+		*pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3);
+		if (!*pbl_addr) {
+			return -ENOMEM;
+		}
+	}
+
+	down_interruptible(&rdev_p->ctrl_qp.sem);
+
+	/* write PBL first if any - update pbl only if pbl list exist */
+	if (pbl) {
+
+		PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
+		     __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base, 
+		     *pbl_size);
+		err = cxio_hal_ctrl_qp_write_mem(rdev_p, 
+				(*pbl_addr >> 5),
+				(*pbl_size << 3), pbl, 0);
+		if (err)
+			goto ret;
+	}
+
+	/* write TPT entry */
+	if (reset_tpt_entry)
+		memset(&tpt, 0, sizeof(tpt));
+	else {
+		tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID |
+				V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) |
+				V_TPT_STAG_STATE(stag_state) |
+				V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid));
+		BUG_ON(page_size >= 28);
+		tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) | 
+			    	F_TPT_MW_BIND_ENABLE |
+				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
+				V_TPT_PAGE_SIZE(page_size));
+		tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 : 
+				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3));
+		tpt.len = cpu_to_be32(len);
+		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
+		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
+		tpt.rsvd_bind_cnt_or_pstag = 0;
+		tpt.rsvd_pbl_size = reset_tpt_entry ? 0 : 
+				  cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2));
+	}
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				       stag_idx +
+				       (rdev_p->rnic_info.tpt_base >> 5),
+				       sizeof(tpt), &tpt, 1);
+
+	/* release the stag index to free pool */
+	if (reset_tpt_entry)
+		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
+ret:	
+	wptr = rdev_p->ctrl_qp.wptr;
+	up(&rdev_p->ctrl_qp.sem);
+	if (!err)
+		if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+						      wptr)))
+			return -ERESTARTSYS;
+	return err;
+}
+
+/* IN : stag key, pdid, pbl_size
+ * Out: stag index, actaul pbl_size, and pbl_addr allocated. 
+ */
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR, 
+			      perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr));
+}
+
+int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     &pbl_size, &pbl_addr);
+}
+
+int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	u32 pbl_size = 0;
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
+			     NULL, &pbl_size, NULL);
+}
+
+int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
+{
+	struct t3_rdma_init_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p);
+	wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe));
+	wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT));
+	wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) |
+					   V_FW_RIWR_LEN(sizeof(*wqe) >> 3));
+	wqe->wrid.id1 = 0;
+	wqe->qpid = cpu_to_be32(attr->qpid);
+	wqe->pdid = cpu_to_be32(attr->pdid);
+	wqe->scqid = cpu_to_be32(attr->scqid);
+	wqe->rcqid = cpu_to_be32(attr->rcqid);
+	wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base);
+	wqe->rq_size = cpu_to_be32(attr->rq_size);
+	wqe->mpaattrs = attr->mpaattrs;
+	wqe->qpcaps = attr->qpcaps;
+	wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss);
+	wqe->flags = cpu_to_be32(attr->flags);
+	wqe->ord = cpu_to_be32(attr->ord);
+	wqe->ird = cpu_to_be32(attr->ird);
+	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
+	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
+	wqe->rsvd = 0;
+	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = ev_cb;
+}
+
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = NULL;
+}
+
+static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb)
+{
+	static int cnt;
+	struct cxio_rdev *rdev_p = NULL;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x"
+	     " se %0x notify %0x cqbranch %0x creditth %0x\n",
+	     cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg),
+	     RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg),
+	     RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg),
+	     RSPQ_CREDIT_THRESH(rsp_msg));
+	PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d "
+	     "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe), 
+	     CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe), 
+	     CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe), 
+	     CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+	rdev_p = (struct cxio_rdev *)t3cdev_p->ulp;
+	if (!rdev_p) {
+		PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__,
+		     t3cdev_p);
+		return 0;
+	}
+	if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) {
+		rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1;
+		wake_up_interruptible(&rdev_p->ctrl_qp.waitq);
+		dev_kfree_skb_irq(skb);
+	} else if (CQE_QPID(rsp_msg->cqe) == 0xfff8)
+		dev_kfree_skb_irq(skb);
+	else if (cxio_ev_cb)
+		(*cxio_ev_cb) (rdev_p, skb);
+	else
+		dev_kfree_skb_irq(skb);
+	cnt++;
+	return 0;
+}
+
+/* Caller takes care of locking if needed */
+int cxio_rdev_open(struct cxio_rdev *rdev_p)
+{
+	struct net_device *netdev_p = NULL;
+	int err = 0;
+	if (strlen(rdev_p->dev_name)) {
+		if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) {
+			return -EBUSY;
+		}
+		netdev_p = dev_get_by_name(rdev_p->dev_name);
+		if (!netdev_p) {
+			return -EINVAL;
+		}
+		dev_put(netdev_p);
+	} else if (rdev_p->t3cdev_p) {
+		if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) {
+			return -EBUSY;
+		}
+		netdev_p = rdev_p->t3cdev_p->lldev;
+		strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name,
+			T3_MAX_DEV_NAME_LEN);
+	} else {
+		PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	if (cxio_hal_add_rdev(rdev_p))
+		return -ENOMEM;
+
+	PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name);
+	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
+	if (!rdev_p->t3cdev_p)
+		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
+					 &(rdev_p->rnic_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS,
+				    &(rdev_p->port_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+
+	/* 
+	 * qpshift is the number of bits to shift the qpid left in order
+	 * to get the correct address of the doorbell for that qp.
+	 */
+	cxio_init_ucontext(rdev_p, &rdev_p->uctx);
+	rdev_p->qpshift = PAGE_SHIFT - 
+			  ilog2(65536 >> 
+			            ilog2(rdev_p->rnic_info.udbell_len >> 
+					      PAGE_SHIFT));
+	rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT;
+	rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1;
+	PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d "
+	     "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n", 
+	     __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base, 
+  	     rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p), 
+  	     rdev_p->rnic_info.pbl_base, 
+  	     rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base,
+  	     rdev_p->rnic_info.rqt_top);
+	PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu "
+	     "qpnr %d qpmask 0x%x\n", 
+	     rdev_p->rnic_info.udbell_len, 
+	     rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr,
+	     rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask);
+
+	err = cxio_hal_init_ctrl_qp(rdev_p);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing ctrl_qp.\n", 
+		       __FUNCTION__, err);
+		goto err1;
+	}
+ 	err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0,
+				     0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ,
+				     T3_MAX_NUM_PD);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing hal resources.\n", 
+		       __FUNCTION__, err);
+		goto err2;
+	}
+ 	err = cxio_hal_pblpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing pbl mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err3;
+ 	}
+ 	err = cxio_hal_rqtpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing rqt mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err4;
+ 	}
+  	return 0;
+err4:
+ 	cxio_hal_pblpool_destroy(rdev_p);
+err3:
+ 	cxio_hal_destroy_resource(rdev_p->rscp);
+err2:
+	cxio_hal_destroy_ctrl_qp(rdev_p);
+err1:
+	cxio_hal_delete_rdev(rdev_p);
+	return err;
+}
+
+void cxio_rdev_close(struct cxio_rdev *rdev_p)
+{
+	if (rdev_p) {
+		cxio_hal_pblpool_destroy(rdev_p);
+		cxio_hal_rqtpool_destroy(rdev_p);
+		cxio_hal_delete_rdev(rdev_p);
+		rdev_p->t3cdev_p->ulp = NULL;
+		cxio_hal_destroy_ctrl_qp(rdev_p);
+		cxio_hal_destroy_resource(rdev_p->rscp);
+	}
+}
+
+int __init cxio_hal_init(void)
+{
+	if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI))
+		return -ENOMEM;
+	memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *));
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler);
+	return 0;
+}
+
+void __exit cxio_hal_exit(void)
+{
+	int i;
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL);
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		cxio_rdev_close(rdev_tbl[i]);
+	cxio_hal_destroy_rhdl_resource();
+}
+
+static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_swsq *sqp;
+	__u32 ptr = wq->sq_rptr;
+	int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr);
+	
+	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
+	while (count--)
+		if (!sqp->signaled) {
+			ptr++;
+			sqp = wq->sq + Q_PTR2IDX(ptr,  wq->sq_size_log2);
+		} else if (sqp->complete) {
+
+			/* 
+			 * Insert this completed cqe into the swcq.
+			 */
+			PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n",
+			     __FUNCTION__, Q_PTR2IDX(ptr,  wq->sq_size_log2),
+			     Q_PTR2IDX(cq->sw_wptr, cq->size_log2));
+			sqp->cqe.header |= htonl(V_CQE_SWCQE(1));
+			*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) 
+				= sqp->cqe;
+			cq->sw_wptr++;
+			sqp->signaled = 0;
+			break;
+		} else
+			break;
+}
+
+static inline void create_read_req_cqe(struct t3_wq *wq,
+				       struct t3_cqe *hw_cqe,
+				       struct t3_cqe *read_cqe)
+{
+	read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr;
+	read_cqe->len = wq->oldest_read->read_len;
+	read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) |
+				 V_CQE_SWCQE(SW_CQE(*hw_cqe)) |
+				 V_CQE_OPCODE(T3_READ_REQ) |
+				 V_CQE_TYPE(1));
+}
+
+/*
+ * Return a ptr to the next read wr in the SWSQ or NULL.
+ */
+static inline void advance_oldest_read(struct t3_wq *wq)
+{
+
+	u32 rptr = wq->oldest_read - wq->sq + 1;
+	u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2);
+
+	while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) {
+		wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2);
+
+		if (wq->oldest_read->opcode == T3_READ_REQ)
+			return;
+		rptr++;
+	}
+	wq->oldest_read = NULL;
+}
+
+/*
+ * cxio_poll_cq
+ *
+ * Caller must:
+ *     check the validity of the first CQE,
+ *     supply the wq assicated with the qpid.
+ *
+ * credit: cq credit to return to sge.
+ * cqe_flushed: 1 iff the CQE is flushed.
+ * cqe: copy of the polled CQE.
+ *
+ * return value:
+ *     0       CQE returned,
+ *    -1       CQE skipped, try again.
+ */
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit)
+{
+	int ret = 0;
+	struct t3_cqe *hw_cqe, read_cqe;
+
+	*cqe_flushed = 0;
+	*credit = 0;
+	hw_cqe = cxio_next_cqe(cq);
+
+	PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x"
+	     " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n", 
+	     __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe), 
+	     CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe), 
+	     CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe), 
+	     CQE_WRID_LOW(*hw_cqe));
+
+	/* 
+	 * skip cqe's not affiliated with a QP.
+	 */
+	if (wq == NULL) {
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	/*
+	 * Gotta tweak READ completions:
+	 * 	1) the cqe doesn't contain the sq_wptr from the wr.
+	 *	2) opcode not reflected from the wr.
+	 *	3) read_len not reflected from the wr.
+	 *	4) cq_type is RQ_TYPE not SQ_TYPE.
+	 */
+	if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) {
+		
+		/* 
+	 	 * Don't write to the HWCQ, so create a new read req CQE 
+		 * in local memory.
+		 */
+		create_read_req_cqe(wq, hw_cqe, &read_cqe);
+		hw_cqe = &read_cqe;
+		advance_oldest_read(wq);
+	}
+
+	/*
+ 	 * T3A: Discard TERMINATE CQEs.
+	 */
+	if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) {
+		ret = -1;
+		wq->error = 1;
+		goto skip_cqe;
+	}
+
+	if (CQE_STATUS(*hw_cqe) || wq->error) {
+		*cqe_flushed = wq->error;
+		wq->error = 1;
+	
+		/* 
+		 * T3A inserts errors into the CQE.  We cannot return 
+	 	 * these as work completions.
+	 	 */
+		/* incoming write failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE) 
+		     && RQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		/* incoming read request failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+
+		/* incoming SEND with no receive posted failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) &&
+		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+	 * RECV completion.
+	 */
+	if (RQ_TYPE(*hw_cqe)) {
+
+		/* 
+		 * HW only validates 4 bits of MSN.  So we must validate that
+		 * the MSN in the SEND is the next expected MSN.  If its not,
+		 * then we complete this with TPT_ERR_MSN and mark the wq in 
+		 * error.
+		 */
+		if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) {
+			wq->error = 1;
+			hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN));
+			goto proc_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/* 
+ 	 * If we get here its a send completion.
+	 *
+	 * Handle out of order completion. These get stuffed
+	 * in the SW SQ. Then the SW SQ is walked to move any
+	 * now in-order completions into the SW CQ.  This handles
+	 * 2 cases:
+	 * 	1) reaping unsignaled WRs when the first subsequent
+	 *	   signaled WR is completed.
+	 *	2) out of order read completions.
+	 */
+	if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) {
+		struct t3_swsq *sqp;
+
+		PDBG("%s out of order completion going in swsq at idx %ld\n",
+		     __FUNCTION__, 
+		     Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2));
+		sqp = wq->sq + 
+		      Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2);
+		sqp->cqe = *hw_cqe;
+		sqp->complete = 1;
+		ret = -1;
+		goto flush_wq;
+	}
+	
+proc_cqe:
+	*cqe = *hw_cqe;
+
+	/*
+	 * Reap the associated WR(s) that are freed up with this
+	 * completion.
+	 */
+	if (SQ_TYPE(*hw_cqe)) {
+		wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe);
+		PDBG("%s completing sq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2));
+		*cookie = (wq->sq + 
+			   Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id;
+		wq->sq_rptr++;
+	} else {
+		PDBG("%s completing rq idx %ld\n", __FUNCTION__, 
+		     Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		*cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		wq->rq_rptr++;
+	}
+
+flush_wq:
+	/*
+	 * Flush any completed cqes that are now in-order.
+	 */
+	flush_completed_wrs(wq, cq);
+
+skip_cqe:
+	if (SW_CQE(*hw_cqe)) {
+		PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->sw_rptr);
+		++cq->sw_rptr;
+	} else {
+		PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n", 
+		     __FUNCTION__, cq, cq->cqid, cq->rptr);
+		++cq->rptr;
+
+		/*
+		 * T3A: compute credits.
+		 */
+		if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1)))
+		    || ((cq->rptr - cq->wptr) >= 128)) {
+			*credit = cq->rptr - cq->wptr;
+			cq->wptr = cq->rptr;
+		}
+	}
+	return ret;
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
new file mode 100644
index 0000000..bde5cfb
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
@@ -0,0 +1,201 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef  __CXIO_HAL_H__
+#define  __CXIO_HAL_H__
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#include "t3_cpl.h"
+#include "t3cdev.h"
+#include "cxgb3_ctl_defs.h"
+#include "cxio_wr.h"
+
+#define T3_CTRL_QP_ID    FW_RI_SGEEC_START
+#define T3_CTL_QP_TID	 FW_RI_TID_START
+#define T3_CTRL_QP_SIZE_LOG2  8
+#define T3_CTRL_CQ_ID    0
+
+/* TBD */
+#define T3_MAX_NUM_RNIC  8
+#define T3_MAX_NUM_RI (1<<15)
+#define T3_MAX_NUM_QP (1<<15)
+#define T3_MAX_NUM_CQ (1<<15)
+#define T3_MAX_NUM_PD (1<<15)
+#define T3_MAX_PBL_SIZE 256
+#define T3_MAX_RQ_SIZE 1024
+#define T3_MAX_NUM_STAG (1<<15)
+
+#define T3_STAG_UNSET 0xffffffff
+
+#define T3_MAX_DEV_NAME_LEN 32
+
+struct cxio_hal_ctrl_qp {
+	u32 wptr;
+	u32 rptr;
+	struct semaphore sem;	/* for the wtpr, can sleep */
+	wait_queue_head_t waitq;	/* wait for RspQ/CQE msg */
+	union t3_wr *workq;	/* the work request queue */
+	dma_addr_t dma_addr;	/* pci bus address of the workq */
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	void __iomem *doorbell;
+};
+
+struct cxio_hal_resource {
+	struct kfifo *tpt_fifo;
+	spinlock_t tpt_fifo_lock;
+	struct kfifo *qpid_fifo;
+	spinlock_t qpid_fifo_lock;
+	struct kfifo *cqid_fifo;
+	spinlock_t cqid_fifo_lock;
+	struct kfifo *pdid_fifo;
+	spinlock_t pdid_fifo_lock;
+};
+
+struct cxio_qpid_list {
+	struct list_head entry;
+	u32 qpid;
+};
+
+struct cxio_ucontext {
+	struct list_head qpids;
+	struct mutex lock;
+};
+
+struct cxio_rdev {
+	char dev_name[T3_MAX_DEV_NAME_LEN];
+	struct t3cdev *t3cdev_p;
+	struct rdma_info rnic_info;
+	struct adap_ports port_info;
+	struct cxio_hal_resource *rscp;
+	struct cxio_hal_ctrl_qp ctrl_qp;
+	void *ulp;
+	unsigned long qpshift;
+	u32 qpnr;
+	u32 qpmask;
+	struct cxio_ucontext uctx;
+	struct gen_pool *pbl_pool;
+	struct gen_pool *rqt_pool;
+};
+
+static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
+{
+	return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5));
+}
+
+typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p,
+					     struct sk_buff * skb);
+
+#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff)
+#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff)
+#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1)
+#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1)
+#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1)
+#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1)
+#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1)
+#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1)
+#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1)
+
+struct respQ_msg_t {
+	__be32 flags;		/* flit 0 */
+	__be32 cq_ptrid;
+	__be64 rsvd;		/* flit 1 */
+	struct t3_cqe cqe;	/* flits 2-3 */
+};
+
+enum t3_cq_opcode {
+	CQ_ARM_AN = 0x2,
+	CQ_ARM_SE = 0x6,
+	CQ_FORCE_AN = 0x3,
+	CQ_CREDIT_UPDATE = 0x7
+};
+
+int cxio_rdev_open(struct cxio_rdev *rdev);
+void cxio_rdev_close(struct cxio_rdev *rdev);
+int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq, 
+	 	   enum t3_cq_opcode op, u32 credit);
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid);
+int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq,
+		   struct cxio_ucontext *uctx);
+int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq, 
+		    struct cxio_ucontext *uctx);
+int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr);
+int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size, 
+		   u32 pbl_addr);
+int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
+int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+u32 cxio_hal_get_rhdl(void);
+void cxio_hal_put_rhdl(u32 rhdl);
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
+int __init cxio_hal_init(void);
+void __exit cxio_hal_exit(void);
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_flush_hw_cq(struct t3_cq *cq);
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe, 
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+
+#define MOD "iw_cxgb3: "
+#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
+
+#ifdef DEBUG
+void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag);
+void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift);
+void cxio_dump_wqe(union t3_wr *wqe);
+void cxio_dump_wce(struct t3_cqe *wce);
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents);
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid);
+#endif
+
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:58:07 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:58:07 -0600
Subject: [openib-general] [PATCH  v4 11/13] Core Resource Allocation
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135807.21159.36678.stgit@dell3.ogc.int>


Core functions to carve up adapter memory, stag, qp, and cq IDs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  331 ++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_resource.h |   70 +++++
 2 files changed, 401 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 0000000..444df15
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,331 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+				   spinlock_t *fifo_lock,
+				   u32 nr, u32 skip_low,
+				   u32 skip_high,
+				   int random)
+{
+	u32 i, j, entry = 0, idx;
+	u32 random_bytes;
+	u32 rarray[16];
+	spin_lock_init(fifo_lock);
+
+	*fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+	if (IS_ERR(*fifo))
+		return -ENOMEM;
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		__kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32));
+	if (random) {
+		j = 0;
+		random_bytes = random32();
+		for (i = 0; i < RANDOM_SIZE; i++)
+			rarray[i] = i + skip_low;
+		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+			if (j >= RANDOM_SIZE) {
+				j = 0;
+				random_bytes = random32();
+			}
+			idx = (random_bytes >> (j * 2)) & 0xF;
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[idx],
+				sizeof(u32));
+			rarray[idx] = i;
+			j++;	
+		}
+		for (i = 0; i < RANDOM_SIZE; i++)
+			__kfifo_put(*fifo, 
+				(unsigned char *) &rarray[i],
+				sizeof(u32));
+	} else
+		for (i = skip_low; i < nr - skip_high; i++)
+			__kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32));
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32));
+	return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+				   spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low, 
+					  skip_high, 1));
+}
+
+static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p)
+{
+	u32 i;
+
+	spin_lock_init(&rdev_p->rscp->qpid_fifo_lock);
+
+	rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32), 
+					      GFP_KERNEL, 
+					      &rdev_p->rscp->qpid_fifo_lock);
+	if (IS_ERR(rdev_p->rscp->qpid_fifo))
+		return -ENOMEM;
+
+	for (i = 16; i < T3_MAX_NUM_QP; i++)
+		if (!(i & rdev_p->qpmask))
+			__kfifo_put(rdev_p->rscp->qpid_fifo, 
+				    (unsigned char *) &i, sizeof(u32));
+	return 0;
+}
+
+int cxio_hal_init_rhdl_resource(u32 nr_rhdl)
+{
+	return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1,
+				       0);
+}
+
+void cxio_hal_destroy_rhdl_resource(void)
+{
+	kfifo_free(rhdl_fifo);
+}
+
+/* nr_* must be power of 2 */
+int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+			   u32 nr_tpt, u32 nr_pbl,
+			   u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid)
+{
+	int err = 0;
+	struct cxio_hal_resource *rscp;
+
+	rscp = kmalloc(sizeof(*rscp), GFP_KERNEL);
+	if (!rscp)
+		return -ENOMEM;
+	rdev_p->rscp = rscp;
+	err = cxio_init_resource_fifo_random(&rscp->tpt_fifo,
+				      &rscp->tpt_fifo_lock, 
+				      nr_tpt, 1, 0);
+	if (err)
+		goto tpt_err;
+	err = cxio_init_qpid_fifo(rdev_p);
+	if (err)
+		goto qpid_err;
+	err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock, 
+				      nr_cqid, 1, 0);
+	if (err)
+		goto cqid_err;
+	err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock, 
+				      nr_pdid, 1, 0);
+	if (err)
+		goto pdid_err;
+	return 0;
+pdid_err:
+	kfifo_free(rscp->cqid_fifo);
+cqid_err:
+	kfifo_free(rscp->qpid_fifo);
+qpid_err:
+	kfifo_free(rscp->tpt_fifo);
+tpt_err:
+	return -ENOMEM;
+}
+
+/*
+ * returns 0 if no resource available
+ */
+static inline u32 cxio_hal_get_resource(struct kfifo *fifo)
+{
+	u32 entry;
+	if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32)))
+		return entry;
+	else
+		return 0;	/* fifo emptry */
+}
+
+static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry)
+{
+	BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0);
+}
+
+u32 cxio_hal_get_rhdl(void)
+{
+	return cxio_hal_get_resource(rhdl_fifo);
+}
+
+void cxio_hal_put_rhdl(u32 rhdl)
+{
+	cxio_hal_put_resource(rhdl_fifo, rhdl);
+}
+
+u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->tpt_fifo);
+}
+
+void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag)
+{
+	cxio_hal_put_resource(rscp->tpt_fifo, stag);
+}
+
+u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp)
+{
+	u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid)
+{
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	cxio_hal_put_resource(rscp->qpid_fifo, qpid);
+}
+
+u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->cqid_fifo);
+}
+
+void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid)
+{
+	cxio_hal_put_resource(rscp->cqid_fifo, cqid);
+}
+
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->pdid_fifo);
+}
+
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid)
+{
+	cxio_hal_put_resource(rscp->pdid_fifo, pdid);
+}
+
+void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
+{
+	kfifo_free(rscp->tpt_fifo);
+	kfifo_free(rscp->cqid_fifo);
+	kfifo_free(rscp->qpid_fifo);
+	kfifo_free(rscp->pdid_fifo);
+	kfree(rscp);
+}
+
+/*
+ * PBL Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_PBL_SHIFT 8			/* 256B == min PBL size (32 entries) */
+#define PBL_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size);
+	return (u32)addr;
+}
+
+void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size);
+	gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size);
+}
+
+int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1);
+	if (rdev_p->pbl_pool)
+		for (i = rdev_p->rnic_info.pbl_base; 
+		     i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1; 
+		     i += PBL_CHUNK)
+			gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1);
+	return rdev_p->pbl_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->pbl_pool);
+}
+
+/*
+ * RQT Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_RQT_SHIFT 10	/* 1KB == mini RQT size (16 entries) */
+#define RQT_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6);
+	return (u32)addr;
+}
+
+void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6);
+	gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6);
+}
+
+int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1);
+	if (rdev_p->rqt_pool)
+		for (i = rdev_p->rnic_info.rqt_base; 
+		     i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1; 
+		     i += RQT_CHUNK)
+			gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1);
+	return rdev_p->rqt_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->rqt_pool);
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
new file mode 100644
index 0000000..a6bbe83
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_RESOURCE_H__
+#define __CXIO_RESOURCE_H__
+
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/genalloc.h>
+#include "cxio_hal.h"
+
+extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl);
+extern void cxio_hal_destroy_rhdl_resource(void);
+extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+				  u32 nr_tpt, u32 nr_pbl,
+				  u32 nr_rqt, u32 nr_qpid, u32 nr_cqid,
+				  u32 nr_pdid);
+extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag);
+extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid);
+extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid);
+extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp);
+
+#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base )
+extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+
+#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base )
+extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:58:38 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:58:38 -0600
Subject: [openib-general] [PATCH  v4 12/13] Core Debug functions
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135837.21159.29330.stgit@dell3.ogc.int>


Debug code to dump various data structs, some of which are in 
adapter memory.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  205 +++++++++++++++++++++++++++
 1 files changed, 205 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 0000000..22f4f75
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include <linux/types.h>
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag) 
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size = 32;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+	m->len = size;
+	PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size, npages;
+
+	shift += 12;
+	npages = (len + (1ULL << shift) - 1) >> shift;
+	size = npages * sizeof(u64);
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = pbl_addr;
+	m->len = size;
+	PDBG("%s PBL addr 0x%x len %d depth %d\n", 
+		__FUNCTION__, m->addr, m->len, npages);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+	__be64 *data = (__be64 *)wqe;
+	uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+	if (size == 0) 
+		size = 8;
+	while (size > 0) {
+		PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+		size--;
+		data++;
+	}
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+	__be64 *data = (__be64 *)wce;
+	int size = sizeof(*wce);
+
+	while (size > 0) {
+		PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+		size -= 8;
+		data++;
+	}
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+	struct ch_mem_range *m;
+	int size = nents * 64;
+	u64 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base;
+	m->len = size;
+	PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid)
+{
+	struct ch_mem_range *m;
+	int size = TCB_SIZE;
+	u32 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_CM;
+	m->addr = hwtid * size; 
+	m->len = size;
+	PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u32 *)m->buf;
+	while (size > 0) {
+		printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n", 
+			m->addr, 
+			*(data+2), *(data+3), *(data),*(data+1),
+			*(data+6), *(data+7), *(data+4), *(data+5));
+		size -= 32;
+		data += 8;
+		m->addr += 32;
+	}
+	kfree(m);
+}
+#endif


From swise at opengridcomputing.com  Thu Dec 14 05:59:08 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 07:59:08 -0600
Subject: [openib-general] [PATCH  v4 13/13] Kconfig/Makefile
In-Reply-To: <20061214135233.21159.78613.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
Message-ID: <20061214135908.21159.80049.stgit@dell3.ogc.int>


Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/Kconfig           |    1 +
 drivers/infiniband/Makefile          |    1 +
 drivers/infiniband/hw/cxgb3/Kconfig  |   27 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/Makefile |   12 ++++++++++++
 4 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 59b3932..06453ab 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/ehca/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index 570b30a..69bdd55 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mt
 obj-$(CONFIG_INFINIBAND_IPATH)		+= hw/ipath/
 obj-$(CONFIG_INFINIBAND_EHCA)		+= hw/ehca/
 obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
+obj-$(CONFIG_INFINIBAND_CXGB3)		+= hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 0000000..d3db264
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,27 @@
+config INFINIBAND_CXGB3
+	tristate "Chelsio RDMA Driver"
+	depends on CHELSIO_T3 && INFINIBAND
+	select GENERIC_ALLOCATOR
+	---help---
+	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
+	  10GbE adapters.
+
+	  For general information about Chelsio and our products, visit
+	  our website at <http://www.chelsio.com>.
+
+	  For customer support, please visit our customer support page at
+	  <http://www.chelsio.com/support.htm>.
+
+	  Please send feedback to <linux-bugs at chelsio.com>.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called iw_cxgb3.
+
+config INFINIBAND_CXGB3_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_CXGB3
+	default n
+	---help---
+	  This option causes the Chelsio RDMA driver to produce copious
+	  amounts of debug messages.  Select this if you are developing
+	  the driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 0000000..7a89f6d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \
+		-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+	       iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_CXGB3_DEBUG
+EXTRA_CFLAGS += -DDEBUG -g 
+iw_cxgb3-y += core/cxio_dbg.o
+endif


From halr at voltaire.com  Thu Dec 14 05:57:11 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 08:57:11 -0500
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <2875.47466.qm@web8317.mail.in.yahoo.com>
References: <2875.47466.qm@web8317.mail.in.yahoo.com>
Message-ID: <1166104604.28709.126501.camel@hal.voltaire.com>

On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote:
> thanks for your reply,
> 
> >The driver is needed to obtain the information for the IB node to
> fill
> >in the MADs for response to the SMA query. It may also issue some
> traps.
> >Similarly for PMA as well.
> 
> Do u mean to say that HCA driver is needed to pass the HCA related
> information (like GID, GUID, port_info etc..) to the SMA so that it
> can reply to query(or GET ) MADs.

Yes.

>  Isn't SMA capable of doing the same by using "query_(gid, pkey,
> port)" verbs.

One reason I can think of is that not all the needed information is
available via verbs. I think there are some others as well.

> And final  questions  if it is really required to implement
> 'process_mad' in HCA driver then why it is not specified in the IB
> specifications.

IB spec is architecture not implementation.

> Whose duty is this (replying to query MADs) according to the IB
> psec.s(its duty of SMA right?)

Depends on the MAD but if you are referring to the SMA queries, then yes
it is the SMA's responsibility.

> I have observed that process_mad is not implemented in the IBM's eHCA
> driver. what is the case with it?

With eHCA, QP0 is not exposed to the host (at least currently) and the
SMA is totally implemented in firmware.

> PS: I am considering only SMA in the host s/w here.

This is a design choice.

-- Hal

> regards,
> K.Mahesh.
> 
> 
> 
> 
> Hal Rosenstock <halr at voltaire.com> wrote:
>         On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote:
>         > Hello all,
>         > 
>         > I want to know from u people that isi it necessary to
>         implement the
>         > process_mad for a HCA.
>         > 
>         > After looking into the implementations of process_mad in
>         ipath and
>         > mthca drivers i have fount that they are used to reply the
>         MADs with
>         > port_info,gid_info,sm_info etc..
>         > 
>         > But isn't it handled by SMA in the host......
>         
>         The SMA can either be in the host on in firmware (as is
>         typical with the
>         Mellanox silicon).
>         
>         > i am little bit confused now .
>         > please just whether it is required to implement process_mad
>         (suppose)
>         > for new HCA driver....
>         
>         It is. For an example of a host (software SMA), see
>         drivers/infiniband/hw/ipath/ipath_mad.c
>         
>         > if it is required why?
>         
>         The driver is needed to obtain the information for the IB node
>         to fill
>         in the MADs for response to the SMA query. It may also issue
>         some traps.
>         Similarly for PMA as well.
>         
>         -- Hal
>         
>         > Please CC your replies to me.
>         > 
>         > regards,
>         > K.Mahesh.
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         > 
>         >
>         ______________________________________________________________________
>         > Find out what India is talking about on - Yahoo! Answers
>         India
>         > Send FREE SMS to your friend's mobile from Yahoo! Messenger
>         Version 8.
>         > Get it NOW
>         > 
>         >
>         ______________________________________________________________________
>         > 
>         > _______________________________________________
>         > openib-general mailing list
>         > openib-general at openib.org
>         > http://openib.org/mailman/listinfo/openib-general
>         > 
>         > To unsubscribe, please visit
>         http://openib.org/mailman/listinfo/openib-general
>         
> 
> 
> ______________________________________________________________________
>  Find out what India is talking about on - Yahoo! Answers India
> Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> Get it NOW


From rdreier at cisco.com  Thu Dec 14 06:31:09 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 06:31:09 -0800
Subject: [openib-general] (no subject)
References: <ada8xhctztu.fsf@cisco.com> <457FB82B.4090902@voltaire.com>
	<adavekfqvhd.fsf@cisco.com> <45810901.3090209@voltaire.com>
Message-ID: <adaslfio8gi.fsf@cisco.com>

 > mmm, I understand all the comments raised during the review were fixed
 > in the V3 post below, and now you say its both wrong and ugly... for
 > example what's wrong here?

I take back the wrong statement, I misread the patch just now.  But if
you don't think the patch is ugly then I don't think we're looking at
the same thing.

For example

 > +static int __devinit mthca_check_profile_value(int* pval, int pval_default){

and so on...


From philippe_bernadat at hp.com  Thu Dec 14 06:39:32 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 15:39:32 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net>

So I did tried tune_pci=1. Didn't make any difference.

I used the same nodes to compare the lscpi output.
I could see:

[root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed
30c30
<       Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32
---
>       Capabilities: [40] MSI-X: Enable- Mask- TabSize=32
38,39c38,39
< 40: 11 50 1f 80 00 20 08 00 00 22 08 00 00 00 00 00
< 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
---
> 40: 11 50 1f 00 00 20 08 00 00 22 08 00 00 00 00 00
> 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00

So I added the ib_mthca msi_x=1 option.
It didn't help.

So the only remaining difference now is:

[root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 
39c39
< 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
---
> 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00

No idea what this is.

Philippe

> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Thursday, December 14, 2006 1:24 PM
> To: Tziporet Koren
> Cc: Eric Barton; Roland Dreier; Matt Leininger; 
> openib-general at openib.org
> Subject: RE: [openib-general] Performance Degradation with 
> OFED v. Voltaire
> 
> 
> > Have you tried running with
> > 
> > options ib_mthca tune_pci =1
> > 
> 
> My understanding is that this is not required anymore with 
> OFED-1.1 - It used to make a siginifciant differences with 
> OFED-1.0, but I didn't observe it with OFED-1.1
> 
> And again, the user mode performance if comparable between 
> VIB and OFED.
> 
> Philippe
> 
> > -----Original Message-----
> > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
> > Sent: Thursday, December 14, 2006 12:30 PM
> > To: Bernadat, Philippe
> > Cc: Eric Barton; Roland Dreier; Matt Leininger; 
> > openib-general at openib.org; Bernadat, Philippe
> > Subject: Re: [openib-general] Performance Degradation with 
> > OFED v. Voltaire
> > 
> > Philippe Bernadat wrote:
> > > Roland,
> > >
> > > Attached are the two lspci outputs.
> > >
> > > The only differences I see are:
> > >
> > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed
> > > 1d0
> > > < pcilib: Resource 5 in 
> > /sys/bus/pci/devices/0000:00:1f.1/resource has 
> > > a 64-bit address, ignoring
> > > 40c39
> > > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> > > ---
> > > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> > > [philippe at hamish o2ib]$
> > >
> > Have you tried running with
> > 
> > options ib_mthca tune_pci =1
> > 
> > Tziporet
> > 
> > 


From mst at mellanox.co.il  Thu Dec 14 06:40:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 16:40:15 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <ada4przsa6v.fsf@cisco.com>
References: <ada4przsa6v.fsf@cisco.com>
Message-ID: <20061214144015.GC27620@mellanox.co.il>

> I was going to apply this, but then I realized that mthca is screwed
> up on non-cache-coherent CPUs with memfree HCAs, and this patch makes
> things much worse.  The problem is that we allocate the MTT table with
> alloc_pages() and then do pci_map_sg().  But there's no
> pci_dma_sync_sg calls when the CPU tries to write directly to the MTT
> table, and in fact not even that would work: since a
> non-cache-coherent CPU can only work on cacheline-sized chunks there's
> no safe way to touch the MTT table.
> 
> What all that means is that FMRs are currently broken for memfree on
> non-coherent CPUs.  And this patch would break all memory
> registration.  I think the fix has to be to use dma_alloc_coherent()
> to allocate the pages for the MTT table (and any other table allocated
> in lowmem -- but I don't think there are any others).
> 
> Unfortunately my PowerPC 440 system is being reworked right now so I
> can't test this for a few days.
> 
> I think this still can go into 2.6.20 after -rc1 if we can get this
> fixed up.

Just to clarify - do you plan to fix this, or are waiting for me to do it?


-- 
MST


From ogerlitz at voltaire.com  Thu Dec 14 06:44:37 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 14 Dec 2006 16:44:37 +0200
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
Message-ID: <45816355.4010801@voltaire.com>

Sean Hefty wrote:
> Export the rdma cm interfaces to userspace.
> +static ssize_t (*ucma_cmd_table[])(struct ucma_file *file,
> +				   const char __user *inbuf,
> +				   int in_len, int out_len) = {
> +	[RDMA_USER_CM_CMD_CREATE_ID]	= ucma_create_id,
> +	[RDMA_USER_CM_CMD_DESTROY_ID]	= ucma_destroy_id,
> +	[RDMA_USER_CM_CMD_BIND_ADDR]	= ucma_bind_addr,
> +	[RDMA_USER_CM_CMD_RESOLVE_ADDR]	= ucma_resolve_addr,
> +	[RDMA_USER_CM_CMD_RESOLVE_ROUTE]= ucma_resolve_route,
> +	[RDMA_USER_CM_CMD_QUERY_ROUTE]	= ucma_query_route,
> +	[RDMA_USER_CM_CMD_CONNECT]	= ucma_connect,
> +	[RDMA_USER_CM_CMD_LISTEN]	= ucma_listen,
> +	[RDMA_USER_CM_CMD_ACCEPT]	= ucma_accept,
> +	[RDMA_USER_CM_CMD_REJECT]	= ucma_reject,
> +	[RDMA_USER_CM_CMD_DISCONNECT]	= ucma_disconnect,
> +	[RDMA_USER_CM_CMD_INIT_QP_ATTR]	= ucma_init_qp_attr,
> +	[RDMA_USER_CM_CMD_GET_EVENT]	= ucma_get_event,
> +	[RDMA_USER_CM_CMD_GET_OPTION]	= NULL,
> +	[RDMA_USER_CM_CMD_SET_OPTION]	= NULL,
> +	[RDMA_USER_CM_CMD_NOTIFY]	= ucma_notify,
> +};

What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by 
librdmacm? is it something which is on its way out?

Or.


From philippe_bernadat at hp.com  Thu Dec 14 07:09:10 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 16:09:10 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <458168A1.3090009@dev.mellanox.co.il>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538003@idaexc03.emea.cpqcorp.net>

> Its not related to OFED 1.1 or OFED 1.0, but to difference 
> between OFED 
> and VAPI.
> In VAPI this setting was always done. In OFED we do not do it 
> by default 
> and you need this parameter.
> 
> Can you please try it.
> 
> Tziporet
> 

Did. I guess you are still processing you Email :-), see next Emails.


Philippe


From tziporet at dev.mellanox.co.il  Thu Dec 14 07:07:13 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 14 Dec 2006 17:07:13 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05537DAF@idaexc03.emea.cpqcorp.net>
Message-ID: <458168A1.3090009@dev.mellanox.co.il>

Bernadat, Philippe wrote:
>> Have you tried running with
>>
>> options ib_mthca tune_pci =1
>>
>>     
>
> My understanding is that this is not required anymore with OFED-1.1 - It
> used to make a siginifciant differences with OFED-1.0, but I didn't
> observe it with OFED-1.1
>
> And again, the user mode performance if comparable between VIB and OFED.
>
> Philippe
>
>   
Its not related to OFED 1.1 or OFED 1.0, but to difference between OFED 
and VAPI.
In VAPI this setting was always done. In OFED we do not do it by default 
and you need this parameter.

See this note on mthca release notes:

4. Performance degradation due to wrong BIOS configuration:
   The PCI Express spec. requires BIOS to set the MaxReadReq register
   for each card for maximum performance and stability. 

   If you are seeing bandwidth performance degradation, you can try forcing
   the card to behave out of PCI Express spec. by setting the tune_pci=1 module
   parameter.  This tune_pci=1 option was the default setting in OFED
   1.0, which might have masked performance degradation on some systems.

   If tune_pci=1 improves bandwidth, please report the issue to your 
   BIOS vendor. Please note that Mellanox Technologies does not recommend using
   tune_pci=1 in production systems: working with tune_pci=1 option set is
   untested and is known to trigger stability issues on some platforms.


Can you please try it.

Tziporet


From halr at voltaire.com  Thu Dec 14 07:51:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 10:51:03 -0500
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05537F70@idaexc03.emea.cpqcorp.net>
Message-ID: <1166111433.28709.131124.camel@hal.voltaire.com>

On Thu, 2006-12-14 at 09:39, Bernadat, Philippe wrote:
> So I did tried tune_pci=1. Didn't make any difference.
> 
> I used the same nodes to compare the lscpi output.
> I could see:
> 
> [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed
> 30c30
> <       Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32
> ---
> >       Capabilities: [40] MSI-X: Enable- Mask- TabSize=32

Might the MSI-X difference explain it ?

-- Hal

> 38,39c38,39
> < 40: 11 50 1f 80 00 20 08 00 00 22 08 00 00 00 00 00
> < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> ---
> > 40: 11 50 1f 00 00 20 08 00 00 22 08 00 00 00 00 00
> > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> 
> So I added the ib_mthca msi_x=1 option.
> It didn't help.
> 
> So the only remaining difference now is:
> 
> [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 
> 39c39
> < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> ---
> > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> 
> No idea what this is.
> 
> Philippe
> 
> > -----Original Message-----
> > From: Bernadat, Philippe 
> > Sent: Thursday, December 14, 2006 1:24 PM
> > To: Tziporet Koren
> > Cc: Eric Barton; Roland Dreier; Matt Leininger; 
> > openib-general at openib.org
> > Subject: RE: [openib-general] Performance Degradation with 
> > OFED v. Voltaire
> > 
> > 
> > > Have you tried running with
> > > 
> > > options ib_mthca tune_pci =1
> > > 
> > 
> > My understanding is that this is not required anymore with 
> > OFED-1.1 - It used to make a siginifciant differences with 
> > OFED-1.0, but I didn't observe it with OFED-1.1
> > 
> > And again, the user mode performance if comparable between 
> > VIB and OFED.
> > 
> > Philippe
> > 
> > > -----Original Message-----
> > > From: Tziporet Koren [mailto:tziporet at dev.mellanox.co.il] 
> > > Sent: Thursday, December 14, 2006 12:30 PM
> > > To: Bernadat, Philippe
> > > Cc: Eric Barton; Roland Dreier; Matt Leininger; 
> > > openib-general at openib.org; Bernadat, Philippe
> > > Subject: Re: [openib-general] Performance Degradation with 
> > > OFED v. Voltaire
> > > 
> > > Philippe Bernadat wrote:
> > > > Roland,
> > > >
> > > > Attached are the two lspci outputs.
> > > >
> > > > The only differences I see are:
> > > >
> > > > [philippe at hamish o2ib]$ diff lspci.vib lspci.ofed
> > > > 1d0
> > > > < pcilib: Resource 5 in 
> > > /sys/bus/pci/devices/0000:00:1f.1/resource has 
> > > > a 64-bit address, ignoring
> > > > 40c39
> > > > < 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> > > > ---
> > > > > 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> > > > [philippe at hamish o2ib]$
> > > >
> > > Have you tried running with
> > > 
> > > options ib_mthca tune_pci =1
> > > 
> > > Tziporet
> > > 
> > > 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From philippe_bernadat at hp.com  Thu Dec 14 07:56:10 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 16:56:10 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <1166111433.28709.131124.camel@hal.voltaire.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net>

> > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed
> > 30c30
> > <       Capabilities: [40] MSI-X: Enable+ Mask- TabSize=32
> > ---
> > >       Capabilities: [40] MSI-X: Enable- Mask- TabSize=32
> 
> Might the MSI-X difference explain it ?

Yes it went away when I added the option (see lines below)

> > 
> > So I added the ib_mthca msi_x=1 option.
> > It didn't help.
> > 
> > So the only remaining difference now is:
> > 
> > [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 
> > 39c39
> > < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
> > ---
> > > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
> > 
> > No idea what this is.
> > 


From chas at cmf.nrl.navy.mil  Thu Dec 14 07:49:08 2006
From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR)
Date: Thu, 14 Dec 2006 10:49:08 -0500
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <ada64cfqkpg.fsf@cisco.com>
Message-ID: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil>

In message <ada64cfqkpg.fsf at cisco.com>,"Roland Dreier" writes:
>I'm not sure who declared it "unsupported" and I would really like to
>know what issue(s) led to that declaration.  Your report is the first
>I've heard of anything like this, and I have to say that it seems
>pretty implausible that running a 32-bit kernel on 64-bit-capable
>hardware would be the source of problems -- if there is an issue then
>I would expect it to be something to do with the 32-bit kernel.

we saw this "problem" last week actually.  a new instal dual core
duo machine was installed with a 32-bit version of suse 10.  srp ran,
but sometimes the scsi data buffers had minor single byte errors (they
didnt appear to be at page boundaries but i am not certain about 
that).  perhaps a kmap issue?  64-bit machines running 32-bit/PAE
with more than 4GB of memory?

this is anecdotal evidence of course.  we were (are) seeing symbol
errors on the cable but i should think these errors get as far as
the srp layer.


From rdreier at cisco.com  Thu Dec 14 07:58:06 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 07:58:06 -0800
Subject: [openib-general] [PATCH] mthca: save low memory used for
	reserved objects
In-Reply-To: <20061214124629.GB24840@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 14 Dec 2006 14:46:29 +0200")
References: <20061214124629.GB24840@mellanox.co.il>
Message-ID: <adaodq6o4fl.fsf@cisco.com>

 > We never need to allocate memory for reserved objects in low memory.

True, but...

 >  		table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT,
 > -						(use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
 > -						__GFP_NOWARN);
 > +						 GFP_HIGHUSER | __GFP_NOWARN);

...it's quite not so simple, is it?  the chunk being allocated here
might not contain exclusively reserved objects -- it might have some
real objects too.


From rdreier at cisco.com  Thu Dec 14 07:59:12 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 07:59:12 -0800
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <20061214144015.GC27620@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 14 Dec 2006 16:40:15 +0200")
References: <ada4przsa6v.fsf@cisco.com> <20061214144015.GC27620@mellanox.co.il>
Message-ID: <adak60uo4dr.fsf@cisco.com>

 > Just to clarify - do you plan to fix this, or are waiting for me to do it?

I am planning to work on it but I am going on vacation from Dec 17th
until Jan 3rd so it might not be for a while...


From rdreier at cisco.com  Thu Dec 14 08:02:18 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 08:02:18 -0800
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil> (chas
	williams's message of "Thu, 14 Dec 2006 10:49:08 -0500")
References: <200612141549.kBEFn8mZ032667@cmf.nrl.navy.mil>
Message-ID: <adafybio48l.fsf@cisco.com>

 > we saw this "problem" last week actually.  a new instal dual core
 > duo machine was installed with a 32-bit version of suse 10.  srp ran,
 > but sometimes the scsi data buffers had minor single byte errors (they
 > didnt appear to be at page boundaries but i am not certain about 
 > that).  perhaps a kmap issue?  64-bit machines running 32-bit/PAE
 > with more than 4GB of memory?

Core duo (not Core2) isn't 64-bit capable is it?  Did you mean core2
and if so did your problems go away by running a 64-bit kernel?

 - R.


From mst at mellanox.co.il  Thu Dec 14 08:03:05 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 18:03:05 +0200
Subject: [openib-general] [PATCH] mthca: save low memory used for
	reserved objects
In-Reply-To: <adaodq6o4fl.fsf@cisco.com>
References: <20061214124629.GB24840@mellanox.co.il> <adaodq6o4fl.fsf@cisco.com>
Message-ID: <20061214160305.GD27620@mellanox.co.il>

>  > We never need to allocate memory for reserved objects in low memory.
> 
> True, but...
> 
>  >  		table->icm[i] = mthca_alloc_icm(dev, chunk_size >> PAGE_SHIFT,
>  > -						(use_lowmem ? GFP_KERNEL : GFP_HIGHUSER) |
>  > -						__GFP_NOWARN);
>  > +						 GFP_HIGHUSER | __GFP_NOWARN);
> 
> ...it's quite not so simple, is it?  the chunk being allocated here
> might not contain exclusively reserved objects -- it might have some
> real objects too.

Correct. Missed this.

-- 
MST


From rdreier at cisco.com  Thu Dec 14 08:06:04 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 08:06:04 -0800
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net>
	(Philippe Bernadat's message of "Thu, 14 Dec 2006 16:56:10 +0100")
References: <3F3894AC7A13B04E83CEBC95CFD3047E055380B6@idaexc03.emea.cpqcorp.net>
Message-ID: <adabqm6o42b.fsf@cisco.com>

OK, it looks like the PCI config is OK.

I guess the difference must be in the Lustre NAL, since you say other
userspace code gets comparable performance.  Is there any difference
in the architecture of the NAL for the Voltaire stack and the standard
Linux stack?

You may have to rely on Voltaire and/or the Lustre people to fix this,
since they're the only ones with the complete picture about the
Voltaire stack.

 - R.


From philippe_bernadat at hp.com  Thu Dec 14 08:11:37 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 17:11:37 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <adabqm6o42b.fsf@cisco.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>

> I guess the difference must be in the Lustre NAL, since you say other
> userspace code gets comparable performance.  Is there any difference
> in the architecture of the NAL for the Voltaire stack and the standard
> Linux stack?

I think Eric described the major differences earlier on, here it is, see
second half:

On Tue, 2006-12-05 at 12:22 +0000, Eric Barton wrote:
> Hi,
> 
> We'd dearly like some help to understand why we seem to be having
> performance issues with OFED.  When we run a lustre network bandwidth
> benchmark, we find significant performance degradation on OFED versus
> Voltaire...
> 
>              Premap (256 RDMA frags)     Map on demand (1 RDMA frag)
>              Voltaire  OFED  Ratio       Voltaire  OFED  Ratio 
> Writes MB/s  682       567   83 %        577       436   75 %
> Reads MB/s   658       554   84 %        555       432   77 %
> 
> These tests measure the bandwidth of 1MByte transfers pipelined 8
deep.
> All hardware/software was the same, apart from the IB stack and the
lustre
> network driver.
> 
> The architecture of the lustre network drivers for OFED and Voltaire
are
> almost identical.  Both use RC QPs with the same control message
protocol
> to set up bulk data transfers using RDMA WRITE.  Control messages use
a
> credit flow protocol to ensure that they are only sent when buffers
are
> posted to receive them.  Concurrent transfers over the same QP are
> supported so that lustre can pipeline bulk I/O.
> 
> The only difference between the lustre network drivers is that the
Voltaire
> driver has a single global CQ and the OFED driver has 1 CQ per QP.
However
> the measurement above are for a single pair of nodes - in this case
both
> implementations use a single CQ.
> 
> By default, the drivers pre-map all of physical memory so each RDMA
> consists of page fragments.  However, we can also compile both drivers
to
> map on demand using FMR so that RDMA is not fragmented.  The results
above
> compare both methods and although both drivers perform worse when
mapping,
> the OFED driver takes the bigger hit.
> 
> We'd be delighted if anyone can shed any light or can suggest any
steps we
> should take to discover the reason.  We're also very willing to
provide
> assistance if any of the OpenFabrics developers wants to duplicate the
> setup.
> 

 
> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: Thursday, December 14, 2006 5:06 PM
> To: Bernadat, Philippe
> Cc: Hal Rosenstock; Tziporet Koren; openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire
> 
> OK, it looks like the PCI config is OK.
> 
> I guess the difference must be in the Lustre NAL, since you say other
> userspace code gets comparable performance.  Is there any difference
> in the architecture of the NAL for the Voltaire stack and the standard
> Linux stack?
> 
> You may have to rely on Voltaire and/or the Lustre people to fix this,
> since they're the only ones with the complete picture about the
> Voltaire stack.
> 
>  - R.
> 


From philippe_bernadat at hp.com  Thu Dec 14 08:16:12 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Thu, 14 Dec 2006 17:16:12 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <adabqm6o42b.fsf@cisco.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net>

So Roland, what is this subtle difference that remains ?

>>> [root at axis21_EL4_u3 o2ib]$ diff lspci_axis19_vib lspci_axis21_ofed 
>>> 39c39
>>> < 50: 03 60 ff ff 11 11 00 00 00 00 00 00 00 00 00 00
>>> ---
>>>       
>>> > 50: 03 60 ff 7f 11 11 00 00 00 00 00 00 00 00 00 00
>>>>         

Philippe

> -----Original Message-----
> From: Roland Dreier [mailto:rdreier at cisco.com] 
> Sent: Thursday, December 14, 2006 5:06 PM
> To: Bernadat, Philippe
> Cc: Hal Rosenstock; Tziporet Koren; openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire
> 
> OK, it looks like the PCI config is OK.
> 
> I guess the difference must be in the Lustre NAL, since you say other
> userspace code gets comparable performance.  Is there any difference
> in the architecture of the NAL for the Voltaire stack and the standard
> Linux stack?
> 
> You may have to rely on Voltaire and/or the Lustre people to fix this,
> since they're the only ones with the complete picture about the
> Voltaire stack.
> 
>  - R.
> 


From chas at cmf.nrl.navy.mil  Thu Dec 14 08:22:06 2006
From: chas at cmf.nrl.navy.mil (chas williams - CONTRACTOR)
Date: Thu, 14 Dec 2006 11:22:06 -0500
Subject: [openib-general] [PATCH] install.sh: Cause less pain to SRP
 users who didn't RTFM
In-Reply-To: <adafybio48l.fsf@cisco.com>
Message-ID: <200612141622.kBEGM6Lj000670@cmf.nrl.navy.mil>

In message <adafybio48l.fsf at cisco.com>,Roland Dreier writes:
>Core duo (not Core2) isn't 64-bit capable is it?  Did you mean core2
>and if so did your problems go away by running a 64-bit kernel?

sorry, yes i meant core2 duo.  specifically,

	model name      : Intel(R) Xeon(R) CPU            5160  @ 3.00GHz

we havent had a chance to reinstall to a 64-bit version of suse for
this machine.


From rdreier at cisco.com  Thu Dec 14 08:29:57 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 08:29:57 -0800
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net>
	(Philippe Bernadat's message of "Thu, 14 Dec 2006 17:16:12 +0100")
References: <3F3894AC7A13B04E83CEBC95CFD3047E05538102@idaexc03.emea.cpqcorp.net>
Message-ID: <ada7iwuo2yi.fsf@cisco.com>

 > So Roland, what is this subtle difference that remains ?

I'm not sure ... something in the VPD capability.  Doesn't seem
significant.

 - R.


From rdreier at cisco.com  Thu Dec 14 08:31:24 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 08:31:24 -0800
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>
	(Philippe Bernadat's message of "Thu, 14 Dec 2006 17:11:37 +0100")
References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>
Message-ID: <ada3b7io2w3.fsf@cisco.com>

 > I think Eric described the major differences earlier on, here it is, see
 > second half:

OK, I forgot about that.

I guess one last thing to check would be the MTU being used for the RC
connections.  Since this is PCI-X HW then the MTU should be 1024 for
best throughput (instead of the max MTU of 2048).

 - R.


From mst at mellanox.co.il  Thu Dec 14 09:04:55 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 19:04:55 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <ada4przsa6v.fsf@cisco.com>
References: <ada4przsa6v.fsf@cisco.com>
Message-ID: <20061214170455.GA12781@mellanox.co.il>

> and in fact not even that would work: since a non-cache-coherent CPU
> can only work on cacheline-sized chunks there's no safe way to touch the MTT
> table.

Roland, could you please clarify what did you mean by this statement?

With current code firmware might be doing WRITE_MTT while CPU is writing to the
same cache line, and I expect this might confuse things, but it seems that with
my fmr/mr merge patch, we never have both CPU and firmware write to the same
MTTs entries.

So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code
sufficient?

-- 
MST


From mst at mellanox.co.il  Thu Dec 14 09:31:45 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 19:31:45 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <ada3b7io2w3.fsf@cisco.com>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>
	<ada3b7io2w3.fsf@cisco.com>
Message-ID: <20061214173145.GC12781@mellanox.co.il>

>  > I think Eric described the major differences earlier on, here it is, see
>  > second half:
> 
> OK, I forgot about that.
> 
> I guess one last thing to check would be the MTU being used for the RC
> connections.  Since this is PCI-X HW then the MTU should be 1024 for
> best throughput (instead of the max MTU of 2048).

The MTU issue is described in the OFED release notes.
You must turn the Tavor work-around for it on in opensm.
This was introduced late in release cycle to it was deemed safer
to make it off by default.

By the way, Eitan, Hal, can we turn this on by default now?
This was we'll get more feedback from people, and we'll still have
time to turn it off before release if this unexpectedly creates issues.

-- 
MST


From mshefty at ichips.intel.com  Thu Dec 14 09:57:39 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 09:57:39 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <45816355.4010801@voltaire.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com>
Message-ID: <45819093.3090405@ichips.intel.com>

> What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by 
> librdmacm? is it something which is on its way out?

I did not expose those to userspace at this time.  I believe what was there 
needed to be reworked.  For example, the timeout could be generic, rather than 
IB specific, and the option to get a list of path records should be eliminated.

- Sean


From sashak at voltaire.com  Thu Dec 14 10:12:59 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 20:12:59 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214061951.GH1689@mellanox.co.il>
References: <20061213232638.GC14186@sashak.voltaire.com>
	<20061214061951.GH1689@mellanox.co.il>
Message-ID: <20061214181259.GE28849@sashak.voltaire.com>

On 08:19 Thu 14 Dec     , Michael S. Tsirkin wrote:
> > > > For me it is unclear yet how long we may need this - 1.1 still be in
> > > > SVN yet, and 1.1 git branch is updated there.
> > > 
> > > By the way, one can't actually build OFED 1.1 userspace from git
> > > because OFED also applies some patches after checking things out
> > > from svn. They are here:
> > > https://openib.org/svn/gen2/branches/1.1/ofed/patches/user_fixes
> > 
> > I guess those patches should be committed in 1.1 svn branch (and imported
> > to git's 1.1).
> 
> This could be done, but why invest the time?

To do commits? SVN commit was done anyway, just in the different place
and in form of the diffs.

> And once we do touch the branch, who will test that the thing you
> pull from there even works?

How this is different? Who will test branch + ofed_fixes diffs?
Use tag to mark tested version (or date).

> I would say that if you really want to mirror the OFED branch,
> and make it buildable to some extent, the way to do this
> would be to have a single git tree with all of OFED - patches,
> scripts and all.

I'm able to build OpenSM for OFED 1.1 from git tree just fine. And
synced 1.1 branch in git let me some useful stuff - I can log, diff,
rebase and cherry-pick fixes, etc.. - everything is in-tree (I said
that I like branches :)).

> Oh, by the way, some tools in OFED tried to read an svn version
> in their code, this wouldn't work on git.
> And I don't see git trees for a lot of OFED bits - look at
> https://openib.org/svn/gen2/branches/1.1/ofed/

IMHO this is not too much hard to switch OFED 1.1.x to git. But it is not
really my point - I just think that synced 1.1 branch in git tree can be
useful for developers and for 1.1 project's support works.

> What I am trying to say is, let's just keep SVN around and
> do OFED 1.1 maintainance there. You can't fix the history.
> 
> > Any reason why it is not committed?
> 
> This was dicussed before OFED 1.1 and seems to have worked well so far.
> 
> We tried to keep our modifications to upstream as separate as possible -
> this made transition to upstream in OFED 1.2 very easy as it was trivial
> to check what was applied and what wasn't.

I cannot understand how not committing changes helps.

Sasha


From eitan at mellanox.co.il  Thu Dec 14 10:13:28 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Dec 2006 20:13:28 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <20061214173145.GC12781@mellanox.co.il>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>
	<ada3b7io2w3.fsf@cisco.com> <20061214173145.GC12781@mellanox.co.il>
Message-ID: <45819448.8060005@mellanox.co.il>

Michael S. Tsirkin wrote:
>>  > I think Eric described the major differences earlier on, here it is, see
>>  > second half:
>>
>> OK, I forgot about that.
>>
>> I guess one last thing to check would be the MTU being used for the RC
>> connections.  Since this is PCI-X HW then the MTU should be 1024 for
>> best throughput (instead of the max MTU of 2048).
>>     
>
> The MTU issue is described in the OFED release notes.
> You must turn the Tavor work-around for it on in opensm.
> This was introduced late in release cycle to it was deemed safer
> to make it off by default.
>
> By the way, Eitan, Hal, can we turn this on by default now?
> This was we'll get more feedback from people, and we'll still have
> time to turn it off before release if this unexpectedly creates issues.
>
>   
I agree that we should enable this feature by default now.

EZ


From mst at mellanox.co.il  Thu Dec 14 10:40:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 20:40:15 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214181259.GE28849@sashak.voltaire.com>
References: <20061214181259.GE28849@sashak.voltaire.com>
Message-ID: <20061214184015.GE12781@mellanox.co.il>

> > > Any reason why it is not committed?
> > 
> > This was dicussed before OFED 1.1 and seems to have worked well so far.
> > 
> > We tried to keep our modifications to upstream as separate as possible -
> > this made transition to upstream in OFED 1.2 very easy as it was trivial
> > to check what was applied and what wasn't.
> 
> I cannot understand how not committing changes helps.

OFED is tracking trunk and using quilt to manage changes against trunk.
That's why they are in form of patches.
Now that everything is in git, we can look at using stgit for this,
I'm not sure how well would publishing stgit-managed tree work.

-- 
MST


From mst at mellanox.co.il  Thu Dec 14 10:46:31 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 20:46:31 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214181259.GE28849@sashak.voltaire.com>
References: <20061213232638.GC14186@sashak.voltaire.com>
	<20061214061951.GH1689@mellanox.co.il>
	<20061214181259.GE28849@sashak.voltaire.com>
Message-ID: <20061214184631.GF12781@mellanox.co.il>

> > > I guess those patches should be committed in 1.1 svn branch (and imported
> > > to git's 1.1).
> > 
> > This could be done, but why invest the time?
> 
> To do commits? SVN commit was done anyway, just in the different place
> and in form of the diffs.

But it was already done, is my point.

> > And once we do touch the branch, who will test that the thing you
> > pull from there even works?
> 
> How this is different? Who will test branch + ofed_fixes diffs?

No one until we do a bugfix release :).

> > I would say that if you really want to mirror the OFED branch,
> > and make it buildable to some extent, the way to do this
> > would be to have a single git tree with all of OFED - patches,
> > scripts and all.
> 
> I'm able to build OpenSM for OFED 1.1 from git tree just fine. And
> synced 1.1 branch in git let me some useful stuff - I can log, diff,
> rebase and cherry-pick fixes, etc.. - everything is in-tree (I said
> that I like branches :)).

I sure don't have a problem with that. But it would be better to avoid
touching 1.1 svn branch any more than absolutely necessary.

Do you only want this for opensm? opensm happens not to have any patches,
so it's easy.

-- 
MST


From mst at mellanox.co.il  Thu Dec 14 10:52:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 20:52:10 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <20061214170455.GA12781@mellanox.co.il>
References: <ada4przsa6v.fsf@cisco.com> <20061214170455.GA12781@mellanox.co.il>
Message-ID: <20061214185210.GH12781@mellanox.co.il>

> > and in fact not even that would work: since a non-cache-coherent CPU
> > can only work on cacheline-sized chunks there's no safe way to touch the MTT
> > table.
> 
> Roland, could you please clarify what did you mean by this statement?
> 
> With current code firmware might be doing WRITE_MTT while CPU is writing to the
> same cache line, and I expect this might confuse things, but it seems that with
> my fmr/mr merge patch, we never have both CPU and firmware write to the same
> MTTs entries.
> 
> So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code
> sufficient?

Documentation/DMA-mapping.txt actually says:

> Without that, you'd see cacheline
> sharing problems (data corruption) on CPUs with DMA-incoherent caches.
> (The CPU could write to one word, DMA would write to a different one
>  in the same cache line, and one of them could be overwritten.)

So with my patch, since w enevr have both HW and CPU DMA into buffer,
   we should be OK.

-- 
MST

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From sean.hefty at intel.com  Thu Dec 14 11:22:19 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 11:22:19 -0800
Subject: [openib-general] [PATCH] 2.6.20 rdma_ucm: fix struct ucma_event leak
Message-ID: <000001c71fb5$2cf517b0$8698070a@amr.corp.intel.com>

We discard new connection requests while the listen backlog is full,
but leak a struct ucma_event in the process.  Free the structure.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index f51b755..ace2cad 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -209,6 +209,7 @@ static int ucma_event_handler(struct rdm
 	if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) {
 		if (!ctx->backlog) {
 			ret = -EDQUOT;
+			kfree(uevent);
 			goto out;
 		}
 		ctx->backlog--;


From sashak at voltaire.com  Thu Dec 14 11:50:34 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 21:50:34 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214184015.GE12781@mellanox.co.il>
References: <20061214181259.GE28849@sashak.voltaire.com>
	<20061214184015.GE12781@mellanox.co.il>
Message-ID: <20061214195034.GA7838@sashak.voltaire.com>

On 20:40 Thu 14 Dec     , Michael S. Tsirkin wrote:
> 
> Now that everything is in git, we can look at using stgit for this,
> I'm not sure how well would publishing stgit-managed tree work.

I've used stgit couple of months ago, but switched to core git, today it
does everything what stgit did. Don't think however that this is better
for publishing - git-rebase and git-reset produce non-linear history.

Sasha


From eitan at mellanox.co.il  Thu Dec 14 11:53:52 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Dec 2006 21:53:52 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
 completion
In-Reply-To: <4581525C.9060104@mellanox.co.il>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
	<1166098306.28709.122104.camel@hal.voltaire.com>
	<4581525C.9060104@mellanox.co.il>
Message-ID: <4581ABD0.7050509@mellanox.co.il>

Update on analysis of failures:

Eitan Zahavi wrote:
> Hal Rosenstock wrote:
>   
>> Hi Eitan,
>>
>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>>   
>>     
>>> OSM Simulation Regression Summary
>>> OpenSM rev = ____  
>>> ibutils rev = ____  
>>> Total=264 Pass=261 Fail=3
>>>
>>> Pass:
>>> 36 Stability IS1-16.topo
>>> 36 Pkey IS1-16.topo
>>> 36 Multicast IS1-16.topo
>>> 36 LidMgr IS1-16.topo
>>> 35 OsmStress IS1-16.topo
>>> 12 Stability IS3-loop.topo
>>> 12 Stability IS3-128.topo
>>> 12 Pkey IS3-128.topo
>>> 12 OsmStress IS3-128.topo
>>> 12 Multicast IS3-loop.topo
>>> 11 Multicast IS3-128.topo
>>> 11 LidMgr IS3-128.topo
>>>
>>> Failures:
>>> 1 OsmStress IS1-16.topo
>>>       
Job was killed in the middle. Just an accident.
>>> 1 Multicast IS3-128.topo
>>>       
A single packet was dropped on the way to the SM. Still not clear where.
However, I have seen a perfectly good link reported by the drop manager 
as missing.
I will rerun some tests with valgrind as  I think this might be a memory 
corruption issue.
>>> 1 LidMgr IS3-128.topo
>>>       
Seems like the last sweep started before the last change in LID was 
made. So it missed one of the nodes.
Additional sweep was enforced at the end of the test - just to make sure 
all changes are handled.
>>>     
>>>       
>> There are now 2 more failures. You had previously explained OsmStress
>> failure as needing more investigation. Now there is a Multicast and
>> LidMgr failure yet nothing really changed since the previous run the
>> night before. Are these new tests ? What were the failures ?
>>   
>>     
> The tests use random seeds and thus can catch other bugs in each run.
> I am investigating these failures. Some might be due to bugs in the 
> checker code too.
>
> Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
> failed 1 test).
> This to imply the bug is a hard to find one.
>   
>> The repetitions have also been reduced from previous reports. Are these
>> the same or different tests ?
>>   
>>     
> Number of repetitions depends on runtime. The regression started later 
> thus run less iterations.
> I run the "same" tests ("same" means same code not same random sequence).
>   
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From sashak at voltaire.com  Thu Dec 14 12:04:02 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 22:04:02 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214184631.GF12781@mellanox.co.il>
References: <20061213232638.GC14186@sashak.voltaire.com>
	<20061214061951.GH1689@mellanox.co.il>
	<20061214181259.GE28849@sashak.voltaire.com>
	<20061214184631.GF12781@mellanox.co.il>
Message-ID: <20061214200402.GB7838@sashak.voltaire.com>

On 20:46 Thu 14 Dec     , Michael S. Tsirkin wrote:
> 
> > > I would say that if you really want to mirror the OFED branch,
> > > and make it buildable to some extent, the way to do this
> > > would be to have a single git tree with all of OFED - patches,
> > > scripts and all.
> > 
> > I'm able to build OpenSM for OFED 1.1 from git tree just fine. And
> > synced 1.1 branch in git let me some useful stuff - I can log, diff,
> > rebase and cherry-pick fixes, etc.. - everything is in-tree (I said
> > that I like branches :)).
> 
> I sure don't have a problem with that. But it would be better to avoid
> touching 1.1 svn branch any more than absolutely necessary.

Yes, only critical fixes should be committed - actually the patches from
ofed_fixes/ .

> Do you only want this for opensm? opensm happens not to have any patches,
> so it's easy.

OpenSM has 1.1 fixes too - it is all committed.

Sasha


From kliteyn at dev.mellanox.co.il  Thu Dec 14 11:58:29 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Thu, 14 Dec 2006 21:58:29 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
	'hang'
Message-ID: <4581ACE5.9000109@dev.mellanox.co.il>

Hi Hal

This patch fixes a bug that caused ucast manager to return
OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
Added a boolean flag that marks whether there was some change or not
(in which case OSM_SIGNAL_DONE should be returned).

--
Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/opensm/osm_ucast_mgr.h |    6 ++++++
 osm/opensm/osm_ucast_mgr.c         |   13 ++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/osm/include/opensm/osm_ucast_mgr.h b/osm/include/opensm/osm_ucast_mgr.h
index 8237963..39bf45a 100644
--- a/osm/include/opensm/osm_ucast_mgr.h
+++ b/osm/include/opensm/osm_ucast_mgr.h
@@ -104,6 +104,7 @@ typedef struct _osm_ucast_mgr
 	osm_req_t	*p_req;
 	osm_log_t	*p_log;
 	cl_plock_t	*p_lock;
+	boolean_t	 any_change;
 	uint8_t		*lft_buf;
 } osm_ucast_mgr_t;
 /*
@@ -120,6 +121,11 @@ typedef struct _osm_ucast_mgr
 *	p_lock
 *		Pointer to the serializing lock.
 *
+*	any_change
+*		Initialized to FALSE at the beginning of the algorithm,
+*		set to TRUE by osm_ucast_mgr_set_fwd_table() if any mad 
+*		was sent.
+*
 *	lft_buf
 *		LFT buffer - used during LFT calculation/setup.
 *
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index 3341eea..e977253 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -984,6 +984,7 @@ osm_ucast_mgr_set_fwd_table(
     }
     else
     {
+      p_mgr->any_change = TRUE;
       /*
         HACK: for now we will assume we succeeded to send
         and set the local DB based on it. This should allow
@@ -1220,6 +1221,7 @@ osm_ucast_mgr_process(
   if (cl_qmap_count( p_sw_guid_tbl ) == 0)
     goto Exit;
 
+  p_mgr->any_change = FALSE;
   cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
 
   if (!p_routing_eng->build_lid_matrices ||
@@ -1246,13 +1248,10 @@ osm_ucast_mgr_process(
   if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
     __osm_ucast_mgr_dump_tables( p_mgr );
 
-  /*
-    For now don't bother checking if the switch forwarding tables
-    actually needed updating.  The current code will always update
-    them, and thus leave transactions pending on the wire.
-    Therefore, return OSM_SIGNAL_DONE_PENDING.
-  */
-  signal = OSM_SIGNAL_DONE_PENDING;
+  if (p_mgr->any_change)
+     signal = OSM_SIGNAL_DONE_PENDING;
+  else
+     signal = OSM_SIGNAL_DONE;
 
   osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
           "osm_ucast_mgr_process: "
-- 
1.4.4.1.GIT


From raleigh at systemfabricworks.com  Thu Dec 14 12:07:02 2006
From: raleigh at systemfabricworks.com (Raleigh F Rinehart)
Date: Thu, 14 Dec 2006 14:07:02 -0600
Subject: [openib-general] SA MADs and Cisco SM
Message-ID: <4581AEE6.8060905@systemfabricworks.com>

Hi All,
I am developing an API that uses SA MADs to create and get 
ServiceRecords.  I am using the OFED1.1 mad and umad libraries.  
Everything works flawlessly with OpenSM.  However if I run in a fabric 
that is using the SM embedded in a Cisco/Topspin switch, queries 
(SubnAdmGetTable) for ServiceRecords fail with a status 110 (ETIMEDOUT).
Since I don't have direct access to the switch and logs I can't tell 
what is going on from the switch, I am working on getting those logs 
asap.  However I was wondering if there were any known issues with 
interoperability, or functionality with OpenIB and the Cisco SM?

Any ideas or pointers in the right direction would be greatly appreciated.

thanks in advance,
-raleigh


From sashak at voltaire.com  Thu Dec 14 12:16:32 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 22:16:32 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il>
References: <4581ACE5.9000109@dev.mellanox.co.il>
Message-ID: <20061214201632.GD7838@sashak.voltaire.com>

On 21:58 Thu 14 Dec     , Yevgeny Kliteynik wrote:
> Hi Hal
> 
> This patch fixes a bug that caused ucast manager to return
> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> Added a boolean flag that marks whether there was some change or not
> (in which case OSM_SIGNAL_DONE should be returned).
> 
> --
> Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Good finding.

Sasha


From halr at voltaire.com  Thu Dec 14 12:12:29 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 15:12:29 -0500
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
	'hang'
In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il>
References: <4581ACE5.9000109@dev.mellanox.co.il>
Message-ID: <1166127103.28709.140656.camel@hal.voltaire.com>

Hi Yevgeny,

On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> This patch fixes a bug that caused ucast manager to return
> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> Added a boolean flag that marks whether there was some change or not
> (in which case OSM_SIGNAL_DONE should be returned).

Just wondering what is the test case for this ?

-- Hal


From halr at voltaire.com  Thu Dec 14 12:18:07 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 15:18:07 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
 completion
In-Reply-To: <4581ABD0.7050509@mellanox.co.il>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
	<1166098306.28709.122104.camel@hal.voltaire.com>
	<4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il>
Message-ID: <1166127430.28709.140858.camel@hal.voltaire.com>

On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
> Update on analysis of failures:
> 
> Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> >   
> >> Hi Eitan,
> >>
> >> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
> >>   
> >>     
> >>> OSM Simulation Regression Summary
> >>> OpenSM rev = ____  
> >>> ibutils rev = ____  
> >>> Total=264 Pass=261 Fail=3
> >>>
> >>> Pass:
> >>> 36 Stability IS1-16.topo
> >>> 36 Pkey IS1-16.topo
> >>> 36 Multicast IS1-16.topo
> >>> 36 LidMgr IS1-16.topo
> >>> 35 OsmStress IS1-16.topo
> >>> 12 Stability IS3-loop.topo
> >>> 12 Stability IS3-128.topo
> >>> 12 Pkey IS3-128.topo
> >>> 12 OsmStress IS3-128.topo
> >>> 12 Multicast IS3-loop.topo
> >>> 11 Multicast IS3-128.topo
> >>> 11 LidMgr IS3-128.topo
> >>>
> >>> Failures:
> >>> 1 OsmStress IS1-16.topo
> >>>       
> Job was killed in the middle. Just an accident.

Is that always the case ? This one has been consistently failing.
I think you had written something about this failure back in July. I can
dig it out if you want.

> >>> 1 Multicast IS3-128.topo
> >>>       
> A single packet was dropped on the way to the SM. Still not clear where.
> However, I have seen a perfectly good link reported by the drop manager 
> as missing.

I think I may have seen this as well on some rare occasions. I could
never figure out why this happened.

> I will rerun some tests with valgrind as  I think this might be a memory 
> corruption issue.

OK.

> >>> 1 LidMgr IS3-128.topo
> >>>       
> Seems like the last sweep started before the last change in LID was 
> made. So it missed one of the nodes.
> Additional sweep was enforced at the end of the test - just to make sure 
> all changes are handled.

So is this being reported as a failure improperly then ?

-- Hal

> >>>     
> >>>       
> >> There are now 2 more failures. You had previously explained OsmStress
> >> failure as needing more investigation. Now there is a Multicast and
> >> LidMgr failure yet nothing really changed since the previous run the
> >> night before. Are these new tests ? What were the failures ?
> >>   
> >>     
> > The tests use random seeds and thus can catch other bugs in each run.
> > I am investigating these failures. Some might be due to bugs in the 
> > checker code too.
> >
> > Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
> > failed 1 test).
> > This to imply the bug is a hard to find one.
> >   
> >> The repetitions have also been reduced from previous reports. Are these
> >> the same or different tests ?
> >>   
> >>     
> > Number of repetitions depends on runtime. The regression started later 
> > thus run less iterations.
> > I run the "same" tests ("same" means same code not same random sequence).
> >   
> >> -- Hal
> >>
> >>
> >> _______________________________________________
> >> openib-general mailing list
> >> openib-general at openib.org
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>   
> >>     
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From eitan at mellanox.co.il  Thu Dec 14 12:24:26 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 14 Dec 2006 22:24:26 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
 completion
In-Reply-To: <1166127430.28709.140858.camel@hal.voltaire.com>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
	<1166098306.28709.122104.camel@hal.voltaire.com>
	<4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il>
	<1166127430.28709.140858.camel@hal.voltaire.com>
Message-ID: <4581B2FA.7090602@mellanox.co.il>

Hal Rosenstock wrote:
> On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
>   
>> Update on analysis of failures:
>>
>> Eitan Zahavi wrote:
>>     
>>> Hal Rosenstock wrote:
>>>   
>>>       
>>>> Hi Eitan,
>>>>
>>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
>>>>   
>>>>     
>>>>         
>>>>> OSM Simulation Regression Summary
>>>>> OpenSM rev = ____  
>>>>> ibutils rev = ____  
>>>>> Total=264 Pass=261 Fail=3
>>>>>
>>>>> Pass:
>>>>> 36 Stability IS1-16.topo
>>>>> 36 Pkey IS1-16.topo
>>>>> 36 Multicast IS1-16.topo
>>>>> 36 LidMgr IS1-16.topo
>>>>> 35 OsmStress IS1-16.topo
>>>>> 12 Stability IS3-loop.topo
>>>>> 12 Stability IS3-128.topo
>>>>> 12 Pkey IS3-128.topo
>>>>> 12 OsmStress IS3-128.topo
>>>>> 12 Multicast IS3-loop.topo
>>>>> 11 Multicast IS3-128.topo
>>>>> 11 LidMgr IS3-128.topo
>>>>>
>>>>> Failures:
>>>>> 1 OsmStress IS1-16.topo
>>>>>       
>>>>>           
>> Job was killed in the middle. Just an accident.
>>     
>
> Is that always the case ? This one has been consistently failing.
> I think you had written something about this failure back in July. I can
> dig it out if you want.
>
>   
>>>>> 1 Multicast IS3-128.topo
>>>>>       
>>>>>           
>> A single packet was dropped on the way to the SM. Still not clear where.
>> However, I have seen a perfectly good link reported by the drop manager 
>> as missing.
>>     
>
> I think I may have seen this as well on some rare occasions. I could
> never figure out why this happened.
>
>   
>> I will rerun some tests with valgrind as  I think this might be a memory 
>> corruption issue.
>>     
>
> OK.
>
>   
>>>>> 1 LidMgr IS3-128.topo
>>>>>       
>>>>>           
>> Seems like the last sweep started before the last change in LID was 
>> made. So it missed one of the nodes.
>> Additional sweep was enforced at the end of the test - just to make sure 
>> all changes are handled.
>>     
>
> So is this being reported as a failure improperly then ?
>   
Well the test failed. The fix was committed. We will see in the next few 
days if it is really a false alarm.
> -- Hal
>
>   
>>>>>     
>>>>>       
>>>>>           
>>>> There are now 2 more failures. You had previously explained OsmStress
>>>> failure as needing more investigation. Now there is a Multicast and
>>>> LidMgr failure yet nothing really changed since the previous run the
>>>> night before. Are these new tests ? What were the failures ?
>>>>   
>>>>     
>>>>         
>>> The tests use random seeds and thus can catch other bugs in each run.
>>> I am investigating these failures. Some might be due to bugs in the 
>>> checker code too.
>>>
>>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
>>> failed 1 test).
>>> This to imply the bug is a hard to find one.
>>>   
>>>       
>>>> The repetitions have also been reduced from previous reports. Are these
>>>> the same or different tests ?
>>>>   
>>>>     
>>>>         
>>> Number of repetitions depends on runtime. The regression started later 
>>> thus run less iterations.
>>> I run the "same" tests ("same" means same code not same random sequence).
>>>   
>>>       
>>>> -- Hal
>>>>
>>>>
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>>
>>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>>   
>>>>     
>>>>         
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From rdreier at cisco.com  Thu Dec 14 12:40:52 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 12:40:52 -0800
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214195034.GA7838@sashak.voltaire.com> (Sasha
	Khapyorsky's message of "Thu, 14 Dec 2006 21:50:34 +0200")
References: <20061214181259.GE28849@sashak.voltaire.com>
	<20061214184015.GE12781@mellanox.co.il>
	<20061214195034.GA7838@sashak.voltaire.com>
Message-ID: <aday7pamcrv.fsf@cisco.com>

 > I've used stgit couple of months ago, but switched to core git, today it
 > does everything what stgit did. Don't think however that this is better
 > for publishing - git-rebase and git-reset produce non-linear history.

How do you get the equivalent of

    stg pop
    edit patch
    stg refresh
    stg push

with core git?

 - R.


From sashak at voltaire.com  Thu Dec 14 12:51:09 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 22:51:09 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <aday7pamcrv.fsf@cisco.com>
References: <20061214181259.GE28849@sashak.voltaire.com>
	<20061214184015.GE12781@mellanox.co.il>
	<20061214195034.GA7838@sashak.voltaire.com> <aday7pamcrv.fsf@cisco.com>
Message-ID: <20061214205109.GE7838@sashak.voltaire.com>

On 12:40 Thu 14 Dec     , Roland Dreier wrote:
>  > I've used stgit couple of months ago, but switched to core git, today it
>  > does everything what stgit did. Don't think however that this is better
>  > for publishing - git-rebase and git-reset produce non-linear history.
> 
> How do you get the equivalent of
> 
>     stg pop
>     edit patch
>     stg refresh
>     stg push
> 
> with core git?

  git-reset HEAD^
  edit patch
  git-commit -c ORIG_HEAD

I think there is also 'git-commit --amend', but didn't use it yet.

Sasha


From tziporet at dev.mellanox.co.il  Thu Dec 14 12:46:43 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 14 Dec 2006 22:46:43 +0200
Subject: [openib-general] reminder: OFED 1.2 meeting on Monday 18-Dec at 9am
	PST
Message-ID: <4581B833.7060600@dev.mellanox.co.il>

Hi All,

I wish to remind all that we are going to have the OFED 1.2 coordination 
meeting next Monday at 9am PST.
Bridge info same as all meetings (sent by Jeff)

Meeting agenda:
1. Status review of the features we agreed upon to make sure all code 
will be ready by end of January for the alpha release.
2. Feedback on the daily build

Reminder: release plan on the Wiki:
https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features

Please plan to attend since our next meeting is only at 15-January 07 
due to new year holiday.

Tziporet


From rdreier at cisco.com  Thu Dec 14 12:46:38 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 12:46:38 -0800
Subject: [openib-general] SA MADs and Cisco SM
In-Reply-To: <4581AEE6.8060905@systemfabricworks.com> (Raleigh F.
	Rinehart's message of "Thu, 14 Dec 2006 14:07:02 -0600")
References: <4581AEE6.8060905@systemfabricworks.com>
Message-ID: <adatzzymci9.fsf@cisco.com>

 > I am developing an API that uses SA MADs to create and get 
 > ServiceRecords.  I am using the OFED1.1 mad and umad libraries.  
 > Everything works flawlessly with OpenSM.  However if I run in a fabric 
 > that is using the SM embedded in a Cisco/Topspin switch, queries 
 > (SubnAdmGetTable) for ServiceRecords fail with a status 110 (ETIMEDOUT).
 > Since I don't have direct access to the switch and logs I can't tell 
 > what is going on from the switch, I am working on getting those logs 
 > asap.  However I was wondering if there were any known issues with 
 > interoperability, or functionality with OpenIB and the Cisco SM?

I don't know of any issues with the Cisco SM, and I do most of my
development using the Cisco SM running on Cisco switches.

However, since you are seeing a problem, it would probably make sense
to work with Cisco support to figure out if there is an issue with the
embedded SM on Cisco switches.  In this case, since the error is
ETIMEDOUT, it might make sense to try your query with a longer
timeout; it could just be that returning a big table is taking longer
than the timeout you set.  Perhaps opensm works because it's running
on a fast server CPU, while the switch SM is running out of gas on the
embedded CPU in the switch.

 - R.


From rdreier at cisco.com  Thu Dec 14 12:50:03 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 12:50:03 -0800
Subject: [openib-general] userspace git trees
In-Reply-To: <20061214205109.GE7838@sashak.voltaire.com> (Sasha
	Khapyorsky's message of "Thu, 14 Dec 2006 22:51:09 +0200")
References: <20061214181259.GE28849@sashak.voltaire.com>
	<20061214184015.GE12781@mellanox.co.il>
	<20061214195034.GA7838@sashak.voltaire.com> <aday7pamcrv.fsf@cisco.com>
	<20061214205109.GE7838@sashak.voltaire.com>
Message-ID: <adalklamcck.fsf@cisco.com>

 > > How do you get the equivalent of
 > > 
 > >     stg pop
 > >     edit patch
 > >     stg refresh
 > >     stg push
 > > 
 > > with core git?
 > 
 >   git-reset HEAD^
 >   edit patch
 >   git-commit -c ORIG_HEAD
 > 
 > I think there is also 'git-commit --amend', but didn't use it yet.

I don't think either of those is really equivalent.  You can edit the
commit at the end of your current branch, but there's no convenient
analog of stg pop/stg push.

Of course stgit is implemented on top of core git so you can
reimplement it by hand, but I do think there is value in the stgit porcelain.

 - R.


From halr at voltaire.com  Thu Dec 14 12:48:11 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 14 Dec 2006 15:48:11 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-14:normal
 completion
In-Reply-To: <4581B2FA.7090602@mellanox.co.il>
References: <200612140711.kBE7BBIH022678@sw053.yok.mtl.com>
	<1166098306.28709.122104.camel@hal.voltaire.com>
	<4581525C.9060104@mellanox.co.il> <4581ABD0.7050509@mellanox.co.il>
	<1166127430.28709.140858.camel@hal.voltaire.com>
	<4581B2FA.7090602@mellanox.co.il>
Message-ID: <1166129270.28709.141866.camel@hal.voltaire.com>

On Thu, 2006-12-14 at 15:24, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Thu, 2006-12-14 at 14:53, Eitan Zahavi wrote:
> >   
> >> Update on analysis of failures:
> >>
> >> Eitan Zahavi wrote:
> >>     
> >>> Hal Rosenstock wrote:
> >>>   
> >>>       
> >>>> Hi Eitan,
> >>>>
> >>>> On Thu, 2006-12-14 at 02:11, Eitan Zahavi wrote:
> >>>>   
> >>>>     
> >>>>         
> >>>>> OSM Simulation Regression Summary
> >>>>> OpenSM rev = ____  
> >>>>> ibutils rev = ____  
> >>>>> Total=264 Pass=261 Fail=3
> >>>>>
> >>>>> Pass:
> >>>>> 36 Stability IS1-16.topo
> >>>>> 36 Pkey IS1-16.topo
> >>>>> 36 Multicast IS1-16.topo
> >>>>> 36 LidMgr IS1-16.topo
> >>>>> 35 OsmStress IS1-16.topo
> >>>>> 12 Stability IS3-loop.topo
> >>>>> 12 Stability IS3-128.topo
> >>>>> 12 Pkey IS3-128.topo
> >>>>> 12 OsmStress IS3-128.topo
> >>>>> 12 Multicast IS3-loop.topo
> >>>>> 11 Multicast IS3-128.topo
> >>>>> 11 LidMgr IS3-128.topo
> >>>>>
> >>>>> Failures:
> >>>>> 1 OsmStress IS1-16.topo
> >>>>>       
> >>>>>           
> >> Job was killed in the middle. Just an accident.
> >>     
> >
> > Is that always the case ? This one has been consistently failing.
> > I think you had written something about this failure back in July. I can
> > dig it out if you want.
> >
> >   
> >>>>> 1 Multicast IS3-128.topo
> >>>>>       
> >>>>>           
> >> A single packet was dropped on the way to the SM. Still not clear where.
> >> However, I have seen a perfectly good link reported by the drop manager 
> >> as missing.
> >>     
> >
> > I think I may have seen this as well on some rare occasions. I could
> > never figure out why this happened.
> >
> >   
> >> I will rerun some tests with valgrind as  I think this might be a memory 
> >> corruption issue.
> >>     
> >
> > OK.
> >
> >   
> >>>>> 1 LidMgr IS3-128.topo
> >>>>>       
> >>>>>           
> >> Seems like the last sweep started before the last change in LID was 
> >> made. So it missed one of the nodes.
> >> Additional sweep was enforced at the end of the test - just to make sure 
> >> all changes are handled.
> >>     
> >
> > So is this being reported as a failure improperly then ?
> >   
> Well the test failed. The fix was committed.

Which fix ? Are you referring to the one Yevgeny just sent ?

-- Hal

>  We will see in the next few 
> days if it is really a false alarm.
> > -- Hal
> >
> >   
> >>>>>     
> >>>>>       
> >>>>>           
> >>>> There are now 2 more failures. You had previously explained OsmStress
> >>>> failure as needing more investigation. Now there is a Multicast and
> >>>> LidMgr failure yet nothing really changed since the previous run the
> >>>> night before. Are these new tests ? What were the failures ?
> >>>>   
> >>>>     
> >>>>         
> >>> The tests use random seeds and thus can catch other bugs in each run.
> >>> I am investigating these failures. Some might be due to bugs in the 
> >>> checker code too.
> >>>
> >>> Please pay attention the failure rate is low (LidMgr pass 36+11 runs 
> >>> failed 1 test).
> >>> This to imply the bug is a hard to find one.
> >>>   
> >>>       
> >>>> The repetitions have also been reduced from previous reports. Are these
> >>>> the same or different tests ?
> >>>>   
> >>>>     
> >>>>         
> >>> Number of repetitions depends on runtime. The regression started later 
> >>> thus run less iterations.
> >>> I run the "same" tests ("same" means same code not same random sequence).
> >>>   
> >>>       
> >>>> -- Hal
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> openib-general mailing list
> >>>> openib-general at openib.org
> >>>> http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>>   
> >>>>     
> >>>>         
> >>> _______________________________________________
> >>> openib-general mailing list
> >>> openib-general at openib.org
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>   
> >>>       
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From or.gerlitz at gmail.com  Thu Dec 14 12:51:17 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 14 Dec 2006 22:51:17 +0200
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <45819093.3090405@ichips.intel.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
Message-ID: <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>

On 12/14/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > What about the rdma_cm_get_option() and rdma_cm_set_option() exposed by
> > librdmacm? is it something which is on its way out?
>
> I did not expose those to userspace at this time.  I believe what was there
> needed to be reworked.  For example, the timeout could be generic, rather than
> IB specific, and the option to get a list of path records should be eliminated.

I see. I understand that there is some code which is part of OFED
(udapl) that uses this api, what were you thinking to suggest them to
do in the spirit of this code you have posted being the basis for OFED
1.2 ?

Or.


From rdreier at cisco.com  Thu Dec 14 13:00:34 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 13:00:34 -0800
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <20061214185210.GH12781@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 14 Dec 2006 20:52:10 +0200")
References: <ada4przsa6v.fsf@cisco.com> <20061214170455.GA12781@mellanox.co.il>
	<20061214185210.GH12781@mellanox.co.il>
Message-ID: <adad56mmbv1.fsf@cisco.com>

 > > With current code firmware might be doing WRITE_MTT while CPU is writing to the
 > > same cache line, and I expect this might confuse things, but it seems that with
 > > my fmr/mr merge patch, we never have both CPU and firmware write to the same
 > > MTTs entries.
 > > 
 > > So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code
 > > sufficient?

Yes, assuming that the CPU is the only entity ever writing to the MTT
table, then doing pci_dma_sync_sg_for_cpu() before writing and
pci_dma_sync_sg_for_device() afterwards should be OK.  I think.

 > Documentation/DMA-mapping.txt actually says:
 > 
 > > Without that, you'd see cacheline
 > > sharing problems (data corruption) on CPUs with DMA-incoherent caches.
 > > (The CPU could write to one word, DMA would write to a different one
 > >  in the same cache line, and one of them could be overwritten.)

Not sure what the relevance of that is -- it's kind of making the
opposite point, that you need to make sure the CPU never touches a
cacheline that might be DMAed at the same point.  The part you snipped
mentions alignment problems.

What saves us for the MTT table is that with your patch the device
never writes to the MTT table at all.

 - R.


From or.gerlitz at gmail.com  Thu Dec 14 13:07:50 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Thu, 14 Dec 2006 23:07:50 +0200
Subject: [openib-general] (no subject)
In-Reply-To: <adaslfio8gi.fsf@cisco.com>
References: <ada8xhctztu.fsf@cisco.com> <457FB82B.4090902@voltaire.com>
	<adavekfqvhd.fsf@cisco.com> <45810901.3090209@voltaire.com>
	<adaslfio8gi.fsf@cisco.com>
Message-ID: <15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com>

On 12/14/06, Roland Dreier <rdreier at cisco.com> wrote:
>> mmm, I understand all the comments raised during the review were fixed
>> in the V3 post below, and now you say its both wrong and ugly... for
>> example what's wrong here?

> I take back the wrong statement, I misread the patch just now.

good, we are making some progress...

> But if you don't think the patch is ugly then I don't think we're looking at
> the same thing. For example

>> +static int __devinit mthca_check_profile_value(int* pval, int pval_default){

> and so on...

I see. Being less familiar with __devinit and friends, will have to
educate myself a little to see why the current patch is ugly...
anyway, thanks for agreeing to fix it yourself.

Or.


From sashak at voltaire.com  Thu Dec 14 13:14:02 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 14 Dec 2006 23:14:02 +0200
Subject: [openib-general] userspace git trees
In-Reply-To: <adalklamcck.fsf@cisco.com>
References: <20061214181259.GE28849@sashak.voltaire.com>
	<20061214184015.GE12781@mellanox.co.il>
	<20061214195034.GA7838@sashak.voltaire.com> <aday7pamcrv.fsf@cisco.com>
	<20061214205109.GE7838@sashak.voltaire.com> <adalklamcck.fsf@cisco.com>
Message-ID: <20061214211402.GF7838@sashak.voltaire.com>

On 12:50 Thu 14 Dec     , Roland Dreier wrote:
>  > > How do you get the equivalent of
>  > > 
>  > >     stg pop
>  > >     edit patch
>  > >     stg refresh
>  > >     stg push
>  > > 
>  > > with core git?
>  > 
>  >   git-reset HEAD^
>  >   edit patch
>  >   git-commit -c ORIG_HEAD
>  > 
>  > I think there is also 'git-commit --amend', but didn't use it yet.
> 
> I don't think either of those is really equivalent.  You can edit the
> commit at the end of your current branch, but there's no convenient
> analog of stg pop/stg push.

In the "worst" case - git-format-patch/git-am always help.

> Of course stgit is implemented on top of core git so you can
> reimplement it by hand, but I do think there is value in the stgit porcelain.

Sure. I have nothing against stgit, it is nice tool and I used this
successfully couple of months (and switched not because stgit was bad
but because was needed to deal with core git for other stuff anyway).

Sasha


From rdreier at cisco.com  Thu Dec 14 13:12:36 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 13:12:36 -0800
Subject: [openib-general] (no subject)
In-Reply-To: <15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com>
	(Or Gerlitz's message of "Thu, 14 Dec 2006 23:07:50 +0200")
References: <ada8xhctztu.fsf@cisco.com> <457FB82B.4090902@voltaire.com>
	<adavekfqvhd.fsf@cisco.com> <45810901.3090209@voltaire.com>
	<adaslfio8gi.fsf@cisco.com>
	<15ddcffd0612141307r24c95f6ag7bf75482705fa125@mail.gmail.com>
Message-ID: <ada8xhambaz.fsf@cisco.com>

 > I see. Being less familiar with __devinit and friends, will have to
 > educate myself a little to see why the current patch is ugly...
 > anyway, thanks for agreeing to fix it yourself.

>> +static int __devinit mthca_check_profile_value(int* pval, int pval_default){

No, not the __devinit part -- I meant whitespace in "pval_default){".
There's crazy indentation all over, whitespace breakage like

 > +		if (default_profile.fmr_reserved_mtts >= default_profile.num_mtt ) {

And the macro

+#define mthca_check_profile_and_warn(name, var, defval) \
+	if (mthca_check_profile_value(&var, defval)) \
+		mthca_warn(mdev, "invalid %s passed. changed to %d.\n", #name, var); 

is a little crazy -- why can't that if () statement be part of the
function too?

Anyway...

 - R.


 - R.


From swise at opengridcomputing.com  Thu Dec 14 13:28:47 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Thu, 14 Dec 2006 15:28:47 -0600
Subject: [openib-general] librdmacm git repos needs config dir
Message-ID: <1166131727.12420.9.camel@stevo-desktop>

Sean,

The librdmacm git repository needs a config dir or autoconf changes to
make that dir as part of config.  I'm not a autoconf wiz, so I just
created the config dir and put a hidden file named .gitignore in it for
libamso. That way its created when folks clone it.  Dunno if that's the
best way, but it worked...


Steve.


From akepner at sgi.com  Thu Dec 14 13:08:26 2006
From: akepner at sgi.com (akepner at sgi.com)
Date: Thu, 14 Dec 2006 13:08:26 -0800 (PST)
Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race
In-Reply-To: <ada8xhaq5ze.fsf@cisco.com>
References: <Pine.LNX.4.61.0612131626250.24974@localhost.localdomain>
	<ada8xhaq5ze.fsf@cisco.com>
Message-ID: <Pine.LNX.4.61.0612141303200.30447@localhost.localdomain>

On Wed, 13 Dec 2006, Roland Dreier wrote:

> Are there other possible ordering problems involving user memory (not
> in a CQ or QP)?  Something like a CPU on node A writing to memory on
> node B and then posting a work request that makes the HCA DMA from
> that memory on node B, and having the work request doorbell reach the
> HCA before the write to node B actually happens, so the HCA DMAs the
> old contents of node B's memory?

Well, this case could be handled with mb() operations (if
I understand you correctly). The type of race I had in mind
is between DMA operations and updates to data structures
shared between the host and HCA. But, yes, the example I
used was only one of the possiblilities of this type of race.

>
> I guess the only feasible solution to the problem you're pointing out
> is to have libmthca use some special mmap()-based allocator for queues
> so that the kernel can give it memory that has the special
> dma_map_consistent treatment.

That's an excellent idea. (And, now that you've mentioned it,
it's "obvious" ;-) I'll see what I can come up with using
this approach.

>
> Ugh.
>

Well stated.

-- 
Arthur


From mshefty at ichips.intel.com  Thu Dec 14 13:40:05 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 13:40:05 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
	<15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
Message-ID: <4581C4B5.5020702@ichips.intel.com>

> I see. I understand that there is some code which is part of OFED
> (udapl) that uses this api, what were you thinking to suggest them to
> do in the spirit of this code you have posted being the basis for OFED
> 1.2 ?

DAPL has been updated to remove its use of these calls.  The rdma cm timeout is 
essentially 1 minute now.  If needed a kernel fix can be applied to send an MRA 
to increase the timeout, but I'm holding off on doing that unless it's really 
needed.

- Sean


From mst at mellanox.co.il  Thu Dec 14 13:40:21 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Thu, 14 Dec 2006 23:40:21 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <adad56mmbv1.fsf@cisco.com>
References: <ada4przsa6v.fsf@cisco.com> <20061214170455.GA12781@mellanox.co.il>
	<20061214185210.GH12781@mellanox.co.il> <adad56mmbv1.fsf@cisco.com>
Message-ID: <20061214214021.GB19449@mellanox.co.il>

> What saves us for the MTT table is that with your patch the device
> never writes to the MTT table at all.

Except for the reserved MTTs.

-- 
MST


From sweitzen at cisco.com  Thu Dec 14 13:47:14 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Thu, 14 Dec 2006 13:47:14 -0800
Subject: [openib-general] Cisco OFED 1.1 now available
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B99D6D@xmb-sjc-216.amer.cisco.com>

Cisco OFED 1.1 includes OFED 1.1 source code (same source code as that
on openfabrics.org), binary RPMS for RHEL4 and SLES10, firmware for the
tvflash utility, and Cisco documentation.  Anyone who registers at
cisco.com can download it, but you need a Cisco support contract to get
technical support from Cisco.
 
http://www.cisco.com/cgi-bin/tablebuild.pl/sfs-linux
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061214/c68ad9db/attachment.html>

From rdreier at cisco.com  Thu Dec 14 13:54:36 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Thu, 14 Dec 2006 13:54:36 -0800
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <20061214214021.GB19449@mellanox.co.il> (Michael S.
	Tsirkin's message of "Thu, 14 Dec 2006 23:40:21 +0200")
References: <ada4przsa6v.fsf@cisco.com> <20061214170455.GA12781@mellanox.co.il>
	<20061214185210.GH12781@mellanox.co.il> <adad56mmbv1.fsf@cisco.com>
	<20061214214021.GB19449@mellanox.co.il>
Message-ID: <ada4prym9cz.fsf@cisco.com>

 > > What saves us for the MTT table is that with your patch the device
 > > never writes to the MTT table at all.
 > 
 > Except for the reserved MTTs.

Good point.  So I guess we need a patch that makes sure all reserved
MTTs are given their own ICM chunk (which doesn't need to be in
lowmem) to fix things.

 - R.


From bugzilla-daemon at openib.org  Thu Dec 14 14:45:05 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 14:45:05 -0800 (PST)
Subject: [openib-general] [Bug 172] Need an interface to load alternate path
	to RC QP
Message-ID: <20061214224505.30D002283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=172


sean.hefty at intel.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from sean.hefty at intel.com  2006-12-14 14:45 -------
ib_cm_init_qp_attr() was expanded to handle setting the QP attributes for an
alternate path.  ib_cm_establish() was renamed to ib_cm_notify() to allow the
user to signal to the CM that failover has occurred on a connection.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Thu Dec 14 14:46:06 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 14:46:06 -0800 (PST)
Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails
	with -EINVAL
Message-ID: <20061214224606.ADAD42283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=160


sean.hefty at intel.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


------- Comment #2 from sean.hefty at intel.com  2006-12-14 14:46 -------
Fixed applied to upstream version of ib_cm.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Thu Dec 14 15:08:07 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 15:08:07 -0800 (PST)
Subject: [openib-general] [Bug 159] OFED1.0: Missing interfaces
Message-ID: <20061214230807.94EE02283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=159


sean.hefty at intel.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |ASSIGNED


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From kliteyn at dev.mellanox.co.il  Thu Dec 14 15:27:43 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Fri, 15 Dec 2006 01:27:43 +0200
Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [1/2]
Message-ID: <4581DDEF.7000206@dev.mellanox.co.il>

Hi Hal

This patch (1/2) adds Fat Tree routing engine to OpenSM.

--
Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/Makefile.am  |    2 +-
 osm/opensm/main.c       |    3 ++-
 osm/opensm/osm_opensm.c |    2 ++
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index b273eca..64b984b 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -87,7 +87,7 @@ opensm_SOURCES = main.c osm_console.c os
 		 osm_sw_info_rcv_ctrl.c osm_switch.c \
 		 osm_prtn.c osm_prtn_config.c osm_qos.c \
 		 osm_trap_rcv.c osm_trap_rcv_ctrl.c \
-		 osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c \
+		 osm_ucast_mgr.c osm_ucast_updn.c osm_ucast_file.c osm_ucast_ftree.c \
 		 osm_vl15intf.c osm_vl_arb_rcv.c \
 		 osm_vl_arb_rcv_ctrl.c st.c
 if OSMV_OPENIB
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index ca9a749..7b1c325 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -172,7 +172,8 @@ show_usage(void)
   printf( "-R\n"
           "--routing_engine <engine name>\n"
           "          This option chooses routing engine instead of Min Hop\n"
-          "          algorithm (default). Supported engines: updn, file\n\n");
+          "          algorithm (default).\n"
+          "          Supported engines: updn, file, ftree.\n\n");
   printf( "-M\n"
           "--lid_matrix_file <file name>\n"
           "          This option specifies the name of the lid matrix dump file\n"
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 52ae75a..9cac636 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -74,6 +74,7 @@ struct routing_engine_module {
 
 extern int osm_ucast_updn_setup(osm_opensm_t *p_osm);
 extern int osm_ucast_file_setup(osm_opensm_t *p_osm);
+extern int osm_ucast_ftree_setup(osm_opensm_t *p_osm);
 
 static int osm_ucast_null_setup(osm_opensm_t *p_osm);
 
@@ -81,6 +82,7 @@ const static struct routing_engine_modul
 	{ "null", osm_ucast_null_setup },
 	{ "updn", osm_ucast_updn_setup },
 	{ "file", osm_ucast_file_setup },
+	{ "ftree", osm_ucast_ftree_setup },
 	{ NULL, NULL }
 };
 
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Thu Dec 14 15:27:59 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Fri, 15 Dec 2006 01:27:59 +0200
Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine [2/2]
Message-ID: <4581DDFF.2000903@dev.mellanox.co.il>

Hi Hal

This patch (2/2) adds Fat Tree routing engine to OpenSM.

--
Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c | 2936 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 2936 insertions(+), 0 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
new file mode 100644
index 0000000..15e4cd0
--- /dev/null
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -0,0 +1,2936 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of OpenSM FatTree routing
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif
+
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#include <errno.h>
+#include <iba/ib_types.h>
+#include <complib/cl_qmap.h>
+#include <complib/cl_pool.h>
+#include <complib/cl_debug.h>
+#include <opensm/osm_opensm.h>
+#include <opensm/osm_switch.h>
+
+/* This var is predefined and initialized */
+extern osm_opensm_t osm;
+
+/*
+ * FatTree rank is bounded between 2 and 8:
+ *  - Tree of rank 1 has only trivial routing pathes,
+ *    so no need to use FatTree routing.
+ *  - Why maximum rank is 8:
+ *    Each node (switch) is assigned a unique tuple.
+ *    Switches are stored in two cl_qmaps - one is 
+ *    ordered by guid, and the other by a key that is
+ *    generated from tuple. Since cl_qmap supports only
+ *    a 64-bit key, the maximal tuple lenght is 8 bytes.
+ *    which means that maximal tree rank is 8.
+ * Note that the above also implies that each switch 
+ * can have at max 255 up/down ports.
+ */
+
+#define FAT_TREE_MIN_RANK 2
+#define FAT_TREE_MAX_RANK 8
+
+typedef enum {
+   FTREE_DIRECTION_DOWN = -1,
+   FTREE_DIRECTION_SAME,
+   FTREE_DIRECTION_UP
+} ftree_direction_t;
+
+
+/***************************************************
+ **
+ **  Forward references
+ **
+ ***************************************************/
+
+struct ftree_sw_t_;
+struct ftree_hca_t_;
+struct ftree_port_t_;
+struct ftree_port_group_t_;
+struct ftree_fabric_t_;
+
+/***************************************************
+ **
+ **  ftree_tuple_t definition
+ **
+ ***************************************************/
+
+#define FTREE_TUPLE_BUFF_LEN 1024
+#define FTREE_TUPLE_LEN 8
+
+typedef uint8_t ftree_tuple_t[FTREE_TUPLE_LEN];
+typedef uint64_t ftree_tuple_key_t;
+
+/***************************************************
+ **
+ **  ftree_sw_table_element_t definition
+ **
+ ***************************************************/
+
+typedef struct {
+   cl_map_item_t map_item;
+   struct ftree_sw_t_ * p_sw;
+} ftree_sw_tbl_element_t;
+
+/***************************************************
+ **
+ **  ftree_fwd_tbl_t definition
+ **
+ ***************************************************/
+
+typedef uint8_t * ftree_fwd_tbl_t;
+#define FTREE_FWD_TBL_LEN (IB_LID_UCAST_END_HO + 1)
+
+/***************************************************
+ **
+ **  ftree_port_t definition
+ **
+ ***************************************************/
+
+typedef struct ftree_port_t_ 
+{
+   cl_map_item_t  map_item;
+   uint16_t       port_num;           /* port number on the current node */
+   uint16_t       remote_port_num;    /* port number on the remote node */
+   uint32_t       counter_up;         /* number of allocated routs upwards */
+   uint32_t       counter_down;       /* number of allocated routs downwards */
+} ftree_port_t;
+
+/***************************************************
+ **
+ **  ftree_port_group_t definition
+ **
+ ***************************************************/
+
+typedef struct ftree_port_group_t_
+{
+   cl_map_item_t  map_item;
+   ib_net16_t     base_lid;           /* base lid of the current node */
+   uint8_t        lmc;                /* LMC of the current node */
+   ib_net16_t     remote_base_lid;    /* base lid of the remote node */
+   uint8_t        remote_lmc;         /* LMC of the remote node */
+   ib_net64_t     port_guid;          /* port guid of this port */
+   ib_net64_t     remote_port_guid;   /* port guid of the remote port */
+   ib_net64_t     remote_node_guid;   /* node guid of the remote node */
+   uint8_t        remote_node_type;   /* IB_NODE_TYPE_{CA,SWITCH,ROUTER,...} */
+   union remote_hca_or_sw_
+   {
+      struct ftree_hca_t_ * remote_hca;
+      struct ftree_sw_t_  * remote_sw;
+   } remote_hca_or_sw;                /* pointer to remote hca/switch */
+   cl_ptr_vector_t ports;             /* vector of ports to the same lid */
+} ftree_port_group_t;
+
+/***************************************************
+ **
+ **  ftree_sw_t definition
+ **
+ ***************************************************/
+
+typedef struct ftree_sw_t_ 
+{
+   cl_map_item_t          map_item;
+   osm_switch_t         * p_osm_sw;
+   uint8_t                rank;
+   ftree_tuple_t          tuple;
+   ib_net16_t             base_lid;
+   uint8_t                lmc;
+   ftree_port_group_t  ** down_port_groups;
+   uint16_t               down_port_groups_num;
+   ftree_port_group_t  ** up_port_groups;
+   uint16_t               up_port_groups_num;
+   ftree_fwd_tbl_t        lft_buf;
+} ftree_sw_t;
+
+/***************************************************
+ **
+ **  ftree_hca_t definition
+ **
+ ***************************************************/
+
+typedef struct ftree_hca_t_ {
+   cl_map_item_t          map_item;
+   osm_node_t           * p_osm_node;
+   ftree_port_group_t  ** up_port_groups;
+   uint16_t               up_port_groups_num;
+} ftree_hca_t;
+
+/***************************************************
+ **
+ **  ftree_fabric_t definition
+ **
+ ***************************************************/
+
+typedef struct ftree_fabric_t_ 
+{
+   cl_qmap_t     hca_tbl;
+   cl_qmap_t     sw_tbl;
+   cl_qmap_t     sw_by_tuple_tbl;
+   uint32_t      tree_rank;
+   ftree_sw_t ** leaf_switches;
+   uint32_t      leaf_switches_num;
+   uint16_t      max_hcas_per_leaf;
+   cl_pool_t     sw_fwd_tbl_pool;
+} ftree_fabric_t;
+
+/***************************************************
+ **
+ ** comparators
+ **
+ ***************************************************/
+
+int
+__osm_ftree_compare_switches_by_index(
+   IN  const void * p1, 
+   IN  const void * p2)
+{
+   ftree_sw_t ** pp_sw1 = (ftree_sw_t **)p1; 
+   ftree_sw_t ** pp_sw2 = (ftree_sw_t **)p2; 
+
+   uint16_t i;
+   for (i = 0; i < FTREE_TUPLE_LEN; i++)
+   {
+      if ((*pp_sw1)->tuple[i] > (*pp_sw2)->tuple[i])
+         return 1;
+      if ((*pp_sw1)->tuple[i] < (*pp_sw2)->tuple[i])
+         return -1;
+   }
+   return 0;
+}
+
+/***************************************************/
+
+int
+__osm_ftree_compare_port_groups_by_remote_switch_index(
+   IN  const void * p1, 
+   IN  const void * p2)
+{
+   ftree_port_group_t ** pp_g1 = (ftree_port_group_t **)p1; 
+   ftree_port_group_t ** pp_g2 = (ftree_port_group_t **)p2; 
+
+   return __osm_ftree_compare_switches_by_index( 
+                  &((*pp_g1)->remote_hca_or_sw.remote_sw),
+                  &((*pp_g2)->remote_hca_or_sw.remote_sw) );
+}
+
+/***************************************************/
+
+boolean_t
+__osm_ftree_sw_less_by_index(
+   IN  ftree_sw_t * p_sw1,
+   IN  ftree_sw_t * p_sw2)
+{
+   if (__osm_ftree_compare_switches_by_index((void *)&p_sw1,
+                                             (void *)&p_sw2) < 0)
+      return TRUE;
+   return FALSE;
+}
+
+/***************************************************/
+
+boolean_t
+__osm_ftree_sw_greater_by_index(
+   IN  ftree_sw_t * p_sw1,
+   IN  ftree_sw_t * p_sw2)
+{
+   if (__osm_ftree_compare_switches_by_index((void *)&p_sw1,
+                                             (void *)&p_sw2) > 0)
+      return TRUE;
+   return FALSE;
+}
+
+/***************************************************
+ **
+ ** ftree_tuple_t functions
+ **
+ ***************************************************/
+
+static void 
+__osm_ftree_tuple_init(
+   IN  ftree_tuple_t tuple)
+{
+   memset(tuple, 0xFF, FTREE_TUPLE_LEN);
+}
+
+/***************************************************/
+
+static inline boolean_t
+__osm_ftree_tuple_assigned(
+   IN  ftree_tuple_t tuple)
+{
+   return (tuple[0] != 0xFF);
+}
+
+/***************************************************/
+
+#define FTREE_TUPLE_BUFFERS_NUM 6
+
+static char * 
+__osm_ftree_tuple_to_str(
+   IN  ftree_tuple_t tuple)
+{
+   static char buffer[FTREE_TUPLE_BUFFERS_NUM][FTREE_TUPLE_BUFF_LEN];
+   static uint8_t ind = 0;
+   char * ret_buffer;
+   uint32_t i;
+
+   if (!__osm_ftree_tuple_assigned(tuple))
+      return "INDEX.NOT.ASSIGNED";
+
+   buffer[ind][0] = '\0';
+
+   for(i = 0; (i < FTREE_TUPLE_LEN) && (tuple[i] != 0xFF); i++)
+   {
+      if ((strlen(buffer[ind]) + 10) > FTREE_TUPLE_BUFF_LEN)
+         return "INDEX.TOO.LONG";
+      if (i != 0)
+         strcat(buffer[ind],".");
+      sprintf(&buffer[ind][strlen(buffer[ind])], "%u", tuple[i]);
+   }
+
+   ret_buffer = buffer[ind];
+   ind = (ind + 1) % FTREE_TUPLE_BUFFERS_NUM;
+   return ret_buffer;
+} /* __osm_ftree_tuple_to_str() */
+
+/***************************************************/
+
+static inline ftree_tuple_key_t 
+__osm_ftree_tuple_to_key(
+   IN  ftree_tuple_t tuple)
+{
+   ftree_tuple_key_t key;
+   memcpy(&key, tuple, FTREE_TUPLE_LEN);
+   return key;
+}
+
+/***************************************************/
+
+static inline void 
+__osm_ftree_tuple_from_key(
+   IN  ftree_tuple_t tuple, 
+   IN  ftree_tuple_key_t key)
+{
+   memcpy(tuple, &key, FTREE_TUPLE_LEN);
+}
+
+/***************************************************
+ **
+ ** ftree_sw_tbl_element_t functions
+ **
+ ***************************************************/
+
+static ftree_sw_tbl_element_t *
+__osm_ftree_sw_tbl_element_create(
+   IN  ftree_sw_t * p_sw)
+{
+   ftree_sw_tbl_element_t * p_element = 
+      (ftree_sw_tbl_element_t *) malloc(sizeof(ftree_sw_tbl_element_t));
+   if (!p_element)
+       return NULL;
+   memset(p_element, 0,sizeof(ftree_sw_tbl_element_t));
+
+   if (p_element)
+      p_element->p_sw = p_sw;
+   return p_element;
+}
+
+/***************************************************/
+
+static void
+__osm_ftree_sw_tbl_element_destroy(
+   IN  ftree_sw_tbl_element_t * p_element)
+{
+   if (!p_element)
+      return;
+   free(p_element);
+}
+
+/***************************************************
+ **
+ ** ftree_port_t functions
+ **
+ ***************************************************/
+
+static ftree_port_t * 
+__osm_ftree_port_create( 
+   IN  uint16_t port_num,
+   IN  uint16_t remote_port_num)
+{
+   ftree_port_t * p_port = (ftree_port_t *)malloc(sizeof(ftree_port_t));
+   if (!p_port)
+      return NULL;
+   memset(p_port,0,sizeof(ftree_port_t));
+
+   p_port->port_num = port_num;
+   p_port->remote_port_num = remote_port_num;
+
+   return p_port;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_port_destroy(
+   IN  ftree_port_t * p_port)
+{
+   if(p_port)
+      free(p_port);
+}
+
+/***************************************************
+ **
+ ** ftree_port_group_t functions
+ **
+ ***************************************************/
+
+static ftree_port_group_t * 
+__osm_ftree_port_group_create( 
+   IN  ib_net16_t    base_lid,
+   IN  uint8_t       lmc,
+   IN  ib_net16_t    remote_base_lid,
+   IN  uint8_t       remote_lmc,
+   IN  ib_net64_t  * p_port_guid,
+   IN  ib_net64_t  * p_remote_port_guid,
+   IN  ib_net64_t  * p_remote_node_guid,
+   IN  uint8_t       remote_node_type,
+   IN  void        * p_remote_hca_or_sw)
+{
+   ftree_port_group_t * p_group = 
+            (ftree_port_group_t *)malloc(sizeof(ftree_port_group_t));
+   if (p_group == NULL) 
+      return NULL;
+   memset(p_group, 0, sizeof(ftree_port_group_t));
+
+   p_group->base_lid = base_lid;
+   p_group->lmc = lmc;
+   p_group->remote_base_lid = remote_base_lid;
+   p_group->remote_lmc = remote_lmc;
+   memcpy(&p_group->port_guid, p_port_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->remote_port_guid, p_remote_port_guid, sizeof(ib_net64_t));
+   memcpy(&p_group->remote_node_guid, p_remote_node_guid, sizeof(ib_net64_t));
+
+   p_group->remote_node_type = remote_node_type;
+   switch (remote_node_type)
+   {
+      case IB_NODE_TYPE_CA:
+         p_group->remote_hca_or_sw.remote_hca = (ftree_hca_t *)p_remote_hca_or_sw;
+         break;
+      case IB_NODE_TYPE_SWITCH:
+         p_group->remote_hca_or_sw.remote_sw = (ftree_sw_t *)p_remote_hca_or_sw;
+         break;
+      default:
+         /* we shouldn't get here - port is created only in hca or switch */
+         CL_ASSERT(0);
+   }
+
+   cl_ptr_vector_init(&p_group->ports,
+                      0,  /* min size */
+                      8); /* grow size */
+   return p_group;
+} /* __osm_ftree_port_group_create() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_port_group_destroy(
+   IN  ftree_port_group_t * p_group)
+{
+   uint32_t i;
+   uint32_t size;
+   ftree_port_t * p_port;
+
+   if (!p_group)
+      return;
+
+   /* remove all the elements of p_group->ports vector */
+   size = cl_ptr_vector_get_size(&p_group->ports);
+   for (i = 0; i < size; i++)
+   {
+      cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port);
+      __osm_ftree_port_destroy(p_port);
+   }
+   cl_ptr_vector_destroy(&p_group->ports);
+   free(p_group);
+} /* __osm_ftree_port_group_destroy() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_port_group_dump(
+   IN  ftree_port_group_t * p_group,
+   IN  ftree_direction_t direction)
+{
+   ftree_port_t * p_port;
+   uint32_t size;
+   uint32_t i;
+   char buff[10*1024];
+
+   if (!p_group)
+      return;
+
+   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+      return;
+
+   size = cl_ptr_vector_get_size(&p_group->ports);
+   buff[0] = '\0';
+
+   for (i = 0; i < size; i++)
+   {
+      cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port);
+      CL_ASSERT(p_port);
+
+      if (i != 0)
+         strcat(buff,", ");
+      sprintf(buff + strlen(buff), "%u", p_port->port_num);
+   }
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,
+           "__osm_ftree_port_group_dump:"
+           "    Port Group of size %u, port(s): %s, direction: %s\n" 
+           "                  Local <--> Remote GUID (LID):"
+           "0x%016" PRIx64 " (0x%x) <--> 0x%016" PRIx64 " (0x%x)\n", 
+           size,
+           buff,
+           (direction == FTREE_DIRECTION_DOWN)? "DOWN" : "UP",
+           cl_ntoh64(p_group->port_guid),
+           cl_ntoh16(p_group->base_lid),
+           cl_ntoh64(p_group->remote_port_guid),
+           cl_ntoh16(p_group->remote_base_lid));
+
+} /* __osm_ftree_port_group_dump() */
+
+/***************************************************/
+
+static void
+__osm_ftree_port_group_add_port(
+   IN  ftree_port_group_t * p_group,
+   IN  uint16_t             port_num,
+   IN  uint16_t             remote_port_num)
+{
+   uint16_t i;
+   ftree_port_t * p_port;
+
+   for (i = 0; i < cl_ptr_vector_get_size(&p_group->ports); i++)
+   {
+      cl_ptr_vector_at(&p_group->ports, i, (void **)&p_port);
+      if (p_port->port_num == port_num)
+         return;
+   }
+
+   p_port = __osm_ftree_port_create(port_num,remote_port_num);
+   cl_ptr_vector_insert(&p_group->ports, p_port, NULL);
+}
+
+/***************************************************
+ **
+ ** ftree_sw_t functions
+ **
+ ***************************************************/
+
+static ftree_sw_t * 
+__osm_ftree_sw_create(
+   IN  ftree_fabric_t * p_ftree,
+   IN  osm_switch_t   * p_osm_sw)
+{
+   ftree_sw_t * p_sw;
+   uint8_t ports_num;
+
+   /* make sure that the switch has ports */
+   if (osm_switch_get_num_ports(p_osm_sw) == 1)
+      return NULL;
+
+   p_sw = (ftree_sw_t *)malloc(sizeof(ftree_sw_t));
+   if (p_sw == NULL) 
+      return NULL;
+   memset(p_sw, 0, sizeof(ftree_sw_t));
+
+   p_sw->p_osm_sw = p_osm_sw;
+   p_sw->rank = 0xFF;
+   __osm_ftree_tuple_init(p_sw->tuple);
+
+   p_sw->base_lid = osm_node_get_base_lid(osm_switch_get_node_ptr(p_sw->p_osm_sw),0);
+
+   ports_num = osm_node_get_num_physp(osm_switch_get_node_ptr(p_sw->p_osm_sw));
+   p_sw->down_port_groups = 
+      (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *));
+   p_sw->up_port_groups = 
+      (ftree_port_group_t **) malloc(ports_num * sizeof(ftree_port_group_t *));
+   if (!p_sw->down_port_groups || !p_sw->up_port_groups)
+      return NULL;
+   p_sw->down_port_groups_num = 0;
+   p_sw->up_port_groups_num = 0;
+
+   /* initialize lft buffer */
+   p_sw->lft_buf = (ftree_fwd_tbl_t)cl_pool_get(&p_ftree->sw_fwd_tbl_pool);
+   memset(p_sw->lft_buf, OSM_NO_PATH, FTREE_FWD_TBL_LEN);
+
+   return p_sw;
+} /* __osm_ftree_sw_create() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_sw_destroy(
+   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_sw_t     * p_sw)
+{
+   uint8_t i;
+
+   if (!p_sw)
+      return;
+
+   for (i = 0; i < p_sw->down_port_groups_num; i++)
+      __osm_ftree_port_group_destroy(p_sw->down_port_groups[i]);
+   for (i = 0; i < p_sw->up_port_groups_num; i++)
+      __osm_ftree_port_group_destroy(p_sw->up_port_groups[i]);
+   if (p_sw->down_port_groups)
+      free(p_sw->down_port_groups);
+   if (p_sw->up_port_groups)
+      free(p_sw->up_port_groups);
+
+   /* return switch fwd_tbl to pool */
+   if (p_sw->lft_buf)
+      cl_pool_put(&p_ftree->sw_fwd_tbl_pool, (void *)p_sw->lft_buf);
+
+   free(p_sw);
+} /* __osm_ftree_sw_destroy() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_sw_dump(
+   IN  ftree_sw_t * p_sw)
+{
+   uint32_t i;
+   if (!p_sw)
+      return;
+
+   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+      return;
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,
+           "__osm_ftree_sw_dump: "
+           "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n",
+          __osm_ftree_tuple_to_str(p_sw->tuple),
+          cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))), 
+          p_sw->down_port_groups_num, 
+          p_sw->up_port_groups_num);
+
+   for( i = 0; i < p_sw->down_port_groups_num; i++ ) 
+      __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN);
+   for( i = 0; i < p_sw->up_port_groups_num; i++ ) 
+      __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP);
+
+} /* __osm_ftree_sw_dump() */
+
+/***************************************************/
+
+static boolean_t
+__osm_ftree_sw_ranked(
+   IN  ftree_sw_t * p_sw)
+{
+   return (p_sw->rank != 0xFF); 
+}
+
+/***************************************************/
+
+static ftree_port_group_t *
+__osm_ftree_sw_get_port_group_by_remote_lid(
+   IN  ftree_sw_t       * p_sw,
+   IN  ib_net16_t         remote_base_lid,
+   IN  ftree_direction_t  direction)
+{
+   uint32_t i;
+   uint32_t size;
+   ftree_port_group_t ** port_groups;
+
+   if (direction == FTREE_DIRECTION_UP)
+   {
+      port_groups = p_sw->up_port_groups;
+      size = p_sw->up_port_groups_num;
+   }
+   else
+   {
+      port_groups = p_sw->down_port_groups;
+      size = p_sw->down_port_groups_num;
+   }
+
+   for (i = 0; i < size; i++)
+      if (remote_base_lid == port_groups[i]->remote_base_lid)
+         return port_groups[i];
+
+   return NULL;
+} /* __osm_ftree_sw_get_port_group_by_remote_lid() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_sw_add_port(
+   IN  ftree_sw_t       * p_sw,
+   IN  uint16_t           port_num,
+   IN  uint16_t           remote_port_num,
+   IN  ib_net16_t         base_lid,
+   IN  uint8_t            lmc,
+   IN  ib_net16_t         remote_base_lid,
+   IN  uint8_t            remote_lmc,
+   IN  ib_net64_t         port_guid,
+   IN  ib_net64_t         remote_port_guid,
+   IN  ib_net64_t         remote_node_guid,
+   IN  uint8_t            remote_node_type,
+   IN  void             * p_remote_hca_or_sw,
+   IN  ftree_direction_t  direction)
+{
+   ftree_port_group_t * p_group = 
+       __osm_ftree_sw_get_port_group_by_remote_lid(p_sw,remote_base_lid,direction);
+
+   if (!p_group)
+   {
+      p_group = __osm_ftree_port_group_create(
+                     base_lid,
+                     lmc,
+                     remote_base_lid,
+                     remote_lmc,
+                     &port_guid,
+                     &remote_port_guid,
+                     &remote_node_guid,
+                     remote_node_type,
+                     p_remote_hca_or_sw);
+      CL_ASSERT(p_group);
+
+      if (direction == FTREE_DIRECTION_UP)
+         p_sw->up_port_groups[p_sw->up_port_groups_num++] = p_group;
+      else
+         p_sw->down_port_groups[p_sw->down_port_groups_num++] = p_group;
+   }
+   __osm_ftree_port_group_add_port(p_group,port_num,remote_port_num);
+} /* __osm_ftree_sw_add_port() */
+
+/***************************************************/
+
+static void
+__osm_ftree_sw_set_fwd_table_block(
+    IN  ftree_sw_t * p_sw,
+    IN  uint16_t     lid_ho, 
+    IN  uint8_t      port_num)
+{
+   p_sw->lft_buf[lid_ho] = port_num;
+}
+
+/***************************************************/
+
+static uint8_t
+__osm_ftree_sw_get_fwd_table_block(
+    IN  ftree_sw_t * p_sw,
+    IN  uint16_t     lid_ho)
+{
+   return p_sw->lft_buf[lid_ho];
+}
+
+/***************************************************
+ **
+ ** ftree_hca_t functions
+ **
+ ***************************************************/
+
+static ftree_hca_t * 
+__osm_ftree_hca_create(
+   IN  osm_node_t * p_osm_node)
+{
+   ftree_hca_t * p_hca = (ftree_hca_t *)malloc(sizeof(ftree_hca_t));
+   if (p_hca == NULL) 
+      return NULL;
+   memset(p_hca,0,sizeof(ftree_hca_t));
+
+   p_hca->p_osm_node = p_osm_node;
+   p_hca->up_port_groups = (ftree_port_group_t **) 
+        malloc(osm_node_get_num_physp(p_hca->p_osm_node) * sizeof (ftree_port_group_t *));
+   if (!p_hca->up_port_groups)
+      return NULL;
+   p_hca->up_port_groups_num = 0;
+   return p_hca;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_hca_destroy(
+   IN  ftree_hca_t * p_hca)
+{
+   uint32_t i;
+
+   if (!p_hca)
+      return;
+
+   for (i = 0; i < p_hca->up_port_groups_num; i++)
+      __osm_ftree_port_group_destroy(p_hca->up_port_groups[i]);
+
+   if (p_hca->up_port_groups)
+      free(p_hca->up_port_groups);
+
+   free(p_hca);
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_hca_dump(
+   IN  ftree_hca_t * p_hca)
+{
+   uint32_t i;
+   if (!p_hca)
+      return;
+
+   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+      return;
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,
+           "__osm_ftree_hca_dump: "
+           "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
+          cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), 
+          p_hca->up_port_groups_num);
+
+   for( i = 0; i < p_hca->up_port_groups_num; i++ ) 
+      __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP);
+}
+
+/***************************************************/
+
+static ftree_port_group_t *
+__osm_ftree_hca_get_port_group_by_remote_lid(
+   IN  ftree_hca_t * p_hca,
+   IN  ib_net16_t    remote_base_lid)
+{
+   uint32_t i;
+   for (i = 0; i < p_hca->up_port_groups_num; i++)
+      if (remote_base_lid == p_hca->up_port_groups[i]->remote_base_lid)
+         return p_hca->up_port_groups[i];
+
+   return NULL;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_hca_add_port(
+   IN  ftree_hca_t * p_hca,
+   IN  uint16_t      port_num,
+   IN  uint16_t      remote_port_num,
+   IN  ib_net16_t    base_lid,
+   IN  uint8_t       lmc,
+   IN  ib_net16_t    remote_base_lid,
+   IN  uint8_t       remote_lmc,
+   IN  ib_net64_t    port_guid,
+   IN  ib_net64_t    remote_port_guid,
+   IN  ib_net64_t    remote_node_guid,
+   IN  uint8_t       remote_node_type,
+   IN  void        * p_remote_hca_or_sw)
+{
+   ftree_port_group_t * p_group;
+
+   /* this function is supposed to be called only for adding ports
+      in hca's that lead to switches */ 
+   CL_ASSERT(remote_node_type == IB_NODE_TYPE_SWITCH);
+
+   p_group = __osm_ftree_hca_get_port_group_by_remote_lid(p_hca,remote_base_lid);
+
+   if (!p_group)
+   {
+      p_group = __osm_ftree_port_group_create(
+                     base_lid,
+                     lmc,
+                     remote_base_lid,
+                     remote_lmc,
+                     &port_guid,
+                     &remote_port_guid,
+                     &remote_node_guid,
+                     remote_node_type,
+                     p_remote_hca_or_sw);
+      p_hca->up_port_groups[p_hca->up_port_groups_num++] = p_group;
+   }
+   __osm_ftree_port_group_add_port(p_group, port_num, remote_port_num);
+
+} /* __osm_ftree_hca_add_port() */
+
+/***************************************************
+ **
+ ** ftree_fabric_t functions
+ **
+ ***************************************************/
+
+static ftree_fabric_t * 
+__osm_ftree_fabric_create()
+{
+   cl_status_t status;
+   ftree_fabric_t * p_ftree = (ftree_fabric_t *)malloc(sizeof(ftree_fabric_t));
+   if (p_ftree == NULL) 
+      return NULL;
+
+   memset(p_ftree,0,sizeof(ftree_fabric_t));
+
+   cl_qmap_init(&p_ftree->hca_tbl);
+   cl_qmap_init(&p_ftree->sw_tbl);
+   cl_qmap_init(&p_ftree->sw_by_tuple_tbl);
+
+   status = cl_pool_init( &p_ftree->sw_fwd_tbl_pool,
+                          8,                 /* min pool size */
+                          0,                 /* max pool size - unlimited */
+                          8,                 /* grow size */
+                          FTREE_FWD_TBL_LEN, /* object_size */
+                          NULL,              /* object initializer */
+                          NULL,              /* object destructor */
+                          NULL );            /* context */
+   if (status != CL_SUCCESS)
+      return NULL;
+
+   p_ftree->tree_rank = 1;
+   return p_ftree;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_clear(ftree_fabric_t * p_ftree)
+{
+   ftree_hca_t * p_hca;
+   ftree_hca_t * p_next_hca;
+   ftree_sw_t * p_sw;
+   ftree_sw_t * p_next_sw;
+   ftree_sw_tbl_element_t * p_element;
+   ftree_sw_tbl_element_t * p_next_element;
+
+   if (!p_ftree)
+      return;
+
+   /* remove all the elements of hca_tbl */
+
+   p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+   while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
+   {
+      p_hca = p_next_hca;
+      p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item );
+      __osm_ftree_hca_destroy(p_hca);
+   }
+   cl_qmap_remove_all(&p_ftree->hca_tbl);
+
+   /* remove all the elements of sw_tbl */
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
+   {
+      p_sw = p_next_sw;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+      __osm_ftree_sw_destroy(p_ftree,p_sw);
+   }
+   cl_qmap_remove_all(&p_ftree->sw_tbl);
+
+   /* remove all the elements of sw_by_tuple_tbl */
+
+   p_next_element = 
+      (ftree_sw_tbl_element_t *)cl_qmap_head(&p_ftree->sw_by_tuple_tbl);
+   while( p_next_element != 
+          (ftree_sw_tbl_element_t *)cl_qmap_end( &p_ftree->sw_by_tuple_tbl ) )
+   {
+      p_element = p_next_element;
+      p_next_element = 
+         (ftree_sw_tbl_element_t *)cl_qmap_next(&p_element->map_item);
+      __osm_ftree_sw_tbl_element_destroy(p_element);
+   }
+   cl_qmap_remove_all(&p_ftree->sw_by_tuple_tbl);
+
+   /* free the leaf switches array */
+   if ((p_ftree->leaf_switches_num > 0) && (p_ftree->leaf_switches))
+      free(p_ftree->leaf_switches);
+
+   p_ftree->leaf_switches_num = 0;
+   p_ftree->leaf_switches = NULL;
+
+} /* __osm_ftree_fabric_destroy() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_destroy(ftree_fabric_t * p_ftree)
+{
+   if (!p_ftree)
+      return;
+   __osm_ftree_fabric_clear(p_ftree);
+   cl_pool_destroy(&p_ftree->sw_fwd_tbl_pool);
+   free(p_ftree);
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_set_rank(ftree_fabric_t * p_ftree, uint16_t rank)
+{
+   if (rank > p_ftree->tree_rank)
+      p_ftree->tree_rank = rank;
+}
+
+/***************************************************/
+
+static uint16_t 
+__osm_ftree_fabric_get_rank(ftree_fabric_t * p_ftree)
+{
+   return p_ftree->tree_rank;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_add_hca(ftree_fabric_t * p_ftree, osm_node_t * p_osm_node)
+{
+   ftree_hca_t * p_hca = __osm_ftree_hca_create(p_osm_node);
+
+   CL_ASSERT(osm_node_get_type(p_osm_node) == IB_NODE_TYPE_CA);
+
+   cl_qmap_insert(&p_ftree->hca_tbl,
+                  p_osm_node->node_info.node_guid,
+                  &p_hca->map_item);
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_add_sw(ftree_fabric_t * p_ftree, osm_switch_t * p_osm_sw)
+{
+   ftree_sw_t * p_sw = __osm_ftree_sw_create(p_ftree,p_osm_sw);
+
+   CL_ASSERT(osm_node_get_type(p_osm_sw->p_node) == IB_NODE_TYPE_SWITCH);
+
+   cl_qmap_insert(&p_ftree->sw_tbl,
+                  p_osm_sw->p_node->node_info.node_guid,
+                  &p_sw->map_item);
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_add_sw_by_tuple(
+   IN  ftree_fabric_t * p_ftree, 
+   IN  ftree_sw_t * p_sw)
+{
+   CL_ASSERT(__osm_ftree_tuple_assigned(p_sw->tuple));
+
+   cl_qmap_insert(&p_ftree->sw_by_tuple_tbl,
+                  __osm_ftree_tuple_to_key(p_sw->tuple),
+                  &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
+}
+
+/***************************************************/
+
+static ftree_sw_t * 
+__osm_ftree_fabric_get_sw_by_tuple(
+   IN  ftree_fabric_t * p_ftree, 
+   IN  ftree_tuple_t tuple)
+{
+   ftree_sw_tbl_element_t * p_element;
+
+   CL_ASSERT(__osm_ftree_tuple_assigned(tuple));
+
+   __osm_ftree_tuple_to_key(tuple);
+
+   p_element = (ftree_sw_tbl_element_t * )cl_qmap_get(&p_ftree->sw_by_tuple_tbl,
+                                                      __osm_ftree_tuple_to_key(tuple));
+   if (p_element == (ftree_sw_tbl_element_t * )cl_qmap_end(&p_ftree->sw_by_tuple_tbl))
+      return NULL;
+
+   return p_element->p_sw;
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_dump(ftree_fabric_t * p_ftree)
+{
+   uint32_t i;
+   ftree_hca_t * p_hca;
+   ftree_sw_t * p_sw;
+
+   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+      return;
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+           "                       |-------------------------------|\n"
+           "                       |-  Full fabric topology dump  -|\n"
+           "                       |-------------------------------|\n\n");
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,
+           "__osm_ftree_fabric_dump: -- HCAs:\n");
+
+   for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+         p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl);
+         p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) )
+   {
+      __osm_ftree_hca_dump(p_hca);
+   }
+
+   for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
+   {
+      osm_log(&osm.log, OSM_LOG_DEBUG,
+              "__osm_ftree_fabric_dump: -- Rank %u switches\n", i);
+      for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+            p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl);
+            p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
+      {
+         if (p_sw->rank == i)
+            __osm_ftree_sw_dump(p_sw);
+      }
+   }
+
+   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+           "                       |---------------------------------------|\n"
+           "                       |- Full fabric topology dump completed -|\n"
+           "                       |---------------------------------------|\n\n");
+} /* __osm_ftree_fabric_dump() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_dump_general_info(
+   IN  ftree_fabric_t * p_ftree)
+{
+   uint32_t i,j;
+   ftree_sw_t * p_sw;
+   char * addition_str;
+
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n");
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+           "General fabric topology info\n");
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+           "============================\n");
+
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+           "  - FatTree rank (switches only): %u\n",
+          p_ftree->tree_rank);
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+           "  - Fabric has %u HCAs, %u switches\n",
+          cl_qmap_count(&p_ftree->hca_tbl),
+          cl_qmap_count(&p_ftree->sw_tbl));
+
+   for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
+   {
+      j = 0;
+      for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+            p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl);
+            p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
+      {
+         if (p_sw->rank == i)
+            j++;
+      }
+      if (i == 0)
+         addition_str = " (root) ";
+      else 
+         if (i == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+            addition_str = " (leaf) ";
+         else
+            addition_str = " ";
+         osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+                 "  - Fabric has %u rank %u%sswitches\n",j,i,addition_str);
+   }
+
+   if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE))
+   {
+      osm_log(&osm.log, OSM_LOG_VERBOSE,
+              "__osm_ftree_fabric_dump_general_info: "
+              "  - Root switches:\n");
+      for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+            p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl);
+            p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
+      {
+         if (p_sw->rank == 0)
+               osm_log(&osm.log, OSM_LOG_VERBOSE,
+                       "__osm_ftree_fabric_dump_general_info: "
+                       "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
+                       cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+                       cl_ntoh16(p_sw->base_lid),
+                       __osm_ftree_tuple_to_str(p_sw->tuple));
+      }
+
+      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: "
+              "  - Leaf switches (sorted by index):\n");
+      for (i = 0; i < p_ftree->leaf_switches_num; i++)
+      {
+            osm_log(&osm.log, OSM_LOG_VERBOSE,
+                    "__osm_ftree_fabric_dump_general_info: "
+                    "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
+                    cl_ntoh64(osm_node_get_node_guid(
+                                 osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))),
+                    cl_ntoh16(p_ftree->leaf_switches[i]->base_lid),
+                    __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple));
+      }
+   }
+} /* __osm_ftree_fabric_dump_general_info() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_dump_hca_ordering(
+   IN  ftree_fabric_t * p_ftree)
+{  
+   ftree_hca_t        * p_hca;
+   ftree_sw_t         * p_sw;
+   ftree_port_group_t * p_group;
+   uint32_t             i;
+   uint32_t             j;
+
+   char desc[IB_NODE_DESCRIPTION_SIZE + 1];
+   char path[1024];
+   FILE * p_hca_ordering_file;
+   char * filename = "osm-ftree-ca-order.dump";
+
+   snprintf(path, sizeof(path), "%s/%s", 
+            osm.subn.opt.dump_files_dir, filename);
+   p_hca_ordering_file = fopen(path, "w");
+   if (!p_hca_ordering_file) 
+   {
+      osm_log(&osm.log, OSM_LOG_ERROR,
+              "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: "
+              "cannot open file \'%s\': %s\n",
+               filename, strerror(errno));
+      OSM_LOG_EXIT(&(osm.log));
+      return;
+   }
+   
+   /* for each leaf switch (in indexing order) */
+   for(i = 0; i < p_ftree->leaf_switches_num; i++)
+   {
+      p_sw = p_ftree->leaf_switches[i];
+      /* for each real HCA connected to this switch */
+      for (j = 0; j < p_sw->down_port_groups_num; j++)
+      {
+         p_group = p_sw->down_port_groups[j];
+         p_hca = p_group->remote_hca_or_sw.remote_hca;
+         memcpy(desc,p_hca->p_osm_node->node_desc.description,IB_NODE_DESCRIPTION_SIZE);
+         desc[IB_NODE_DESCRIPTION_SIZE] = '\0';
+
+         fprintf(p_hca_ordering_file,"0x%x\t%s\n", 
+                 cl_ntoh16(p_group->remote_base_lid), desc);
+      }
+
+      /* now print dummy HCAs */
+      for (j = p_sw->down_port_groups_num; j < p_ftree->max_hcas_per_leaf; j++)
+      {
+         fprintf(p_hca_ordering_file,"0xFFFF\tDUMMY\n");
+      }
+
+   }
+   /* done going through all the leaf switches */
+
+   fclose(p_hca_ordering_file);
+} /* __osm_ftree_fabric_dump_hca_ordering() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_assign_tuple(
+   IN   ftree_fabric_t * p_ftree,
+   IN   ftree_sw_t * p_sw,
+   IN   ftree_tuple_t new_tuple)
+{
+   memcpy(p_sw->tuple, new_tuple, FTREE_TUPLE_LEN);
+   __osm_ftree_fabric_add_sw_by_tuple(p_ftree,p_sw);
+}
+
+/***************************************************/
+
+static void 
+__osm_ftree_fabric_assign_first_tuple(
+   IN   ftree_fabric_t * p_ftree,
+   IN   ftree_sw_t * p_sw)
+{
+   uint8_t i;
+   ftree_tuple_t new_tuple;
+
+   __osm_ftree_tuple_init(new_tuple);
+   new_tuple[0] = p_sw->rank;
+   for (i = 1; i <= p_sw->rank; i++)
+      new_tuple[i] = 0;
+
+   __osm_ftree_fabric_assign_tuple(p_ftree,p_sw,new_tuple);
+}
+
+/***************************************************/
+
+static void
+__osm_ftree_fabric_get_new_tuple(
+   IN   ftree_fabric_t * p_ftree,
+   OUT  ftree_tuple_t new_tuple,
+   IN   ftree_tuple_t from_tuple,
+   IN   ftree_direction_t direction)
+{
+   ftree_sw_t * p_sw;
+   ftree_tuple_t temp_tuple;
+   uint8_t var_index;
+   uint8_t i;
+
+   __osm_ftree_tuple_init(new_tuple);
+   memcpy(temp_tuple, from_tuple, FTREE_TUPLE_LEN);
+
+   if (direction == FTREE_DIRECTION_DOWN)
+   {
+      temp_tuple[0] ++;
+      var_index = from_tuple[0] + 1;
+   }
+   else
+   {
+      temp_tuple[0] --;
+      var_index = from_tuple[0];
+   }
+
+   for (i = 0; i < 0xFF; i++)
+   {
+      temp_tuple[var_index] = i;
+      p_sw = __osm_ftree_fabric_get_sw_by_tuple(p_ftree,temp_tuple);
+      if (p_sw == NULL) /* found free tuple */ 
+         break;
+   }
+
+   if (i == 0xFF)
+   {
+      /* new tuple not found - there are more than 255 ports in one direction */
+      return;
+   }
+   memcpy(new_tuple, temp_tuple, FTREE_TUPLE_LEN);
+
+} /* __osm_ftree_fabric_get_new_tuple() */
+
+/***************************************************/
+
+static void
+__osm_ftree_fabric_calculate_rank(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t   * p_sw;
+   ftree_sw_t   * p_next_sw;
+   uint16_t       max_rank = 0;
+
+   /* go over all the switches and find maximal switch rank */
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
+   {
+      p_sw = p_next_sw;
+      if(p_sw->rank > max_rank)
+         max_rank = p_sw->rank;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+   }
+
+   /* set FatTree rank */
+   __osm_ftree_fabric_set_rank(p_ftree, max_rank + 1);
+}
+
+/***************************************************/
+
+static void
+__osm_ftree_fabric_make_indexing(
+   IN   ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t         * p_remote_sw;
+   ftree_sw_t         * p_sw;
+   ftree_sw_t         * p_next_sw;
+   ftree_tuple_t        new_tuple;
+   uint32_t             i;
+   cl_list_t            bfs_list;
+   ftree_sw_tbl_element_t * p_sw_tbl_element;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing);
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
+           "Starting FatTree indexing\n");
+
+   /* create array of leaf switches */
+   p_ftree->leaf_switches = (ftree_sw_t **)
+         malloc(cl_qmap_count(&p_ftree->sw_tbl) * sizeof(ftree_sw_t *));
+
+   /* Looking for a leaf switch - the one that has rank equal to (tree_rank - 1).
+      This switch will be used as a starting point for indexing algorithm. */
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
+   {
+      p_sw = p_next_sw;
+      if(p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+         break;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+   }
+
+   CL_ASSERT(p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
+
+   /* Assign the first tuple to the switch that is used as BFS starting point.
+      The tuple will be as follows: [rank].0.0.0...
+      This fuction also adds the switch it into the switch_by_tuple table. */
+   __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw);
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
+           "Indexing starting point:\n"
+           "                                            - Switch rank  : %u\n"
+           "                                            - Switch index : %s\n"
+           "                                            - Node LID     : 0x%x\n"
+           "                                            - Node GUID    : 0x%016" PRIx64 "\n",
+           p_sw->rank,
+           __osm_ftree_tuple_to_str(p_sw->tuple),
+           cl_ntoh16(p_sw->base_lid),
+           cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))));
+
+   /* 
+    * Now run BFS and assign indexes to all switches
+    * Pseudo code of the algorithm is as follows:
+    *
+    *  * Add first switch to BFS queue
+    *  * While (BFS queue not empty)
+    *      - Pop the switch from the head of the queue
+    *      - Scan all the downward and upward ports
+    *      - For each port
+    *          + Get the remote switch
+    *          + Assign index to the remote switch
+    *          + Add remote switch to the BFS queue
+    */
+
+   cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
+   cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_sw)->map_item);
+
+   while (!cl_is_list_empty(&bfs_list))
+   {
+      p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list);
+      p_sw = p_sw_tbl_element->p_sw;
+      __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
+
+      /* Discover all the nodes from ports that are pointing down */
+
+      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      {
+         /* add switch to leaf switches array */
+         p_ftree->leaf_switches[p_ftree->leaf_switches_num++] = p_sw;
+         /* update the max_hcas_per_leaf value */
+         if (p_sw->down_port_groups_num > p_ftree->max_hcas_per_leaf)
+            p_ftree->max_hcas_per_leaf = p_sw->down_port_groups_num;
+      }
+      else
+      {
+         /* This is not the leaf switch, which means that all the
+            ports that point down are taking us to another switches.
+            No need to assign indexing to HCAs */
+         for( i = 0; i < p_sw->down_port_groups_num; i++ ) 
+         {
+            p_remote_sw = p_sw->down_port_groups[i]->remote_hca_or_sw.remote_sw;
+            if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
+            {
+               /* this switch has been already indexed */
+               continue;
+            }
+            /* allocate new tuple */
+            __osm_ftree_fabric_get_new_tuple(p_ftree,
+                                             new_tuple,
+                                             p_sw->tuple,
+                                             FTREE_DIRECTION_DOWN);
+            /* Assign the new tuple to the remote switch.
+               This fuction also adds the switch into the switch_by_tuple table. */
+            __osm_ftree_fabric_assign_tuple(p_ftree,
+                                            p_remote_sw,
+                                            new_tuple);
+
+            /* add the newly discovered switch to the BFS queue */
+            cl_list_insert_tail(&bfs_list, 
+                                &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
+         }
+         /* Done assigning indexes to all the remote switches 
+            that are pointed by the downgoing ports. 
+            Now sort port groups according to remote index. */
+         qsort(p_sw->down_port_groups,                      /* array */
+               p_sw->down_port_groups_num,                  /* number of elements */
+               sizeof(ftree_port_group_t *),                /* size of each element */
+               __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */
+      }
+
+      /* Done indexing switches from ports that go down.
+         Now do the same with ports that are pointing up. */
+
+      if (p_sw->rank != 0)
+      {
+         /* This is not the root switch, which means that all the ports
+            that are pointing up are taking us to another switches. */
+         for( i = 0; i < p_sw->up_port_groups_num; i++ ) 
+         {
+            p_remote_sw = p_sw->up_port_groups[i]->remote_hca_or_sw.remote_sw;
+            if (__osm_ftree_tuple_assigned(p_remote_sw->tuple))
+               continue;
+            /* allocate new tuple */
+            __osm_ftree_fabric_get_new_tuple(p_ftree,
+                                             new_tuple,
+                                             p_sw->tuple,
+                                             FTREE_DIRECTION_UP);
+            /* Assign the new tuple to the remote switch.
+               This fuction also adds the switch to the
+               switch_by_tuple table. */
+            __osm_ftree_fabric_assign_tuple(p_ftree,
+                                            p_remote_sw,
+                                            new_tuple);
+            /* add the newly discovered switch to the BFS queue */
+            cl_list_insert_tail(&bfs_list, 
+                                &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
+         }
+         /* Done assigning indexes to all the remote switches 
+            that are pointed by the upgoing ports. 
+            Now sort port groups according to remote index. */
+         qsort(p_sw->up_port_groups,                        /* array */
+               p_sw->up_port_groups_num,                    /* number of elements */
+               sizeof(ftree_port_group_t *),                /* size of each element */
+               __osm_ftree_compare_port_groups_by_remote_switch_index); /* comparator */
+      }
+      /* Done assigning indexes to all the switches that are directly connected 
+         to the current switch - go to the next switch in the BFS queue */
+   }
+
+   /* sort array of leaf switches by index */
+   qsort(p_ftree->leaf_switches,     /* array */
+         p_ftree->leaf_switches_num, /* number of elements */
+         sizeof(ftree_sw_t *),       /* size of each element */
+         __osm_ftree_compare_switches_by_index); /* comparator */
+
+   OSM_LOG_EXIT(&(osm.log));
+} /* __osm_ftree_fabric_make_indexing() */
+
+/***************************************************/
+
+static boolean_t
+__osm_ftree_fabric_validate_topology(
+   IN   ftree_fabric_t * p_ftree)
+{
+   ftree_port_group_t * p_group;
+   ftree_port_group_t * p_ref_group;
+   ftree_sw_t         * p_sw;
+   ftree_sw_t         * p_next_sw;
+   ftree_sw_t        ** reference_sw_arr;
+   uint16_t             tree_rank = __osm_ftree_fabric_get_rank(p_ftree);
+   boolean_t            res = TRUE;
+   uint8_t              i;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology);
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: "
+           "Validating fabric topology\n");
+
+   reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *));
+   if ( reference_sw_arr == NULL )
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n");
+      return FALSE;
+   }
+   memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *));
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( res && 
+          p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
+   {
+      p_sw = p_next_sw;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+
+      if (!reference_sw_arr[p_sw->rank])
+      {
+         /* This is the first switch in the current level that 
+            we're checking - use it as a reference */
+         reference_sw_arr[p_sw->rank] = p_sw;
+      }
+      else
+      {
+         /* compare this switch properties to the reference switch */
+
+         if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num )
+         {
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                    "ERR AB09: Different number of upward port groups on switches:\n"
+                    "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n"
+                    "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n",
+                    cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))),
+                    cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
+                    __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
+                    reference_sw_arr[p_sw->rank]->up_port_groups_num,
+                    cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+                    cl_ntoh16(p_sw->base_lid),
+                    __osm_ftree_tuple_to_str(p_sw->tuple),
+                    p_sw->up_port_groups_num);
+            res = FALSE;
+            break;
+         }
+
+         if ( p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1) &&
+              reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num )
+         {
+            /* we're allowing some hca's to be missing */
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                    "ERR AB0A: Different number of downward port groups on switches:\n"
+                    "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n"
+                    "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n",
+                    cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))),
+                    cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
+                    __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
+                    reference_sw_arr[p_sw->rank]->down_port_groups_num,
+                    cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+                    cl_ntoh16(p_sw->base_lid),
+                    __osm_ftree_tuple_to_str(p_sw->tuple),
+                    p_sw->down_port_groups_num);
+            res = FALSE;
+            break;
+         }
+
+         if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != 0 )
+         {
+            p_ref_group = reference_sw_arr[p_sw->rank]->up_port_groups[0];
+            for (i = 0; i < p_sw->up_port_groups_num; i++)
+            {
+                p_group = p_sw->up_port_groups[i];
+                if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
+                {
+                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                           "ERR AB0B: Different number of ports in an upward port group on switches:\n"
+                           "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
+                           "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
+                           cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))),
+                           cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
+                           __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
+                           cl_ptr_vector_get_size(&p_ref_group->ports),
+                           cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+                           cl_ntoh16(p_sw->base_lid),
+                           __osm_ftree_tuple_to_str(p_sw->tuple),
+                           cl_ptr_vector_get_size(&p_group->ports));
+                   res = FALSE;
+                   break;
+                }
+            }
+         }
+         if ( reference_sw_arr[p_sw->rank]->down_port_groups_num != 0 &&
+              p_sw->rank != (tree_rank - 1) )
+         {
+            /* we're allowing some hca's to be missing */
+            p_ref_group = reference_sw_arr[p_sw->rank]->down_port_groups[0];
+            for (i = 0; i < p_sw->down_port_groups_num; i++)
+            {
+                p_group = p_sw->down_port_groups[0];
+                if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
+                {
+                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                           "ERR AB0C: Different number of ports in an downward port group on switches:\n"
+                           "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
+                           "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
+                           cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(reference_sw_arr[p_sw->rank]->p_osm_sw))),
+                           cl_ntoh16(reference_sw_arr[p_sw->rank]->base_lid),
+                           __osm_ftree_tuple_to_str(reference_sw_arr[p_sw->rank]->tuple),
+                           cl_ptr_vector_get_size(&p_ref_group->ports),
+                           cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+                           cl_ntoh16(p_sw->base_lid),
+                           __osm_ftree_tuple_to_str(p_sw->tuple),
+                           cl_ptr_vector_get_size(&p_group->ports));
+                   res = FALSE;
+                   break;
+                }
+            }
+         }
+      } /* end of else */
+   } /* end of while */
+
+   if (res == TRUE)
+      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: "
+                    "Fabric topology has been identified as FatTree\n");
+   else
+      osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                    "ERR AB0D: Fabric topology hasn't been identified as FatTree\n");
+
+   free(reference_sw_arr);
+   OSM_LOG_EXIT(&(osm.log));
+   return res;
+} /* __osm_ftree_fabric_validate_topology() */
+
+/***************************************************
+ ***************************************************/
+
+static void
+__osm_ftree_set_sw_fwd_table(
+   IN  cl_map_item_t* const p_map_item, 
+   IN  void *context)
+{
+   ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item;
+   memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN);
+   osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw);
+}
+
+/***************************************************
+ ***************************************************/
+
+/*  
+ * Function: assign-up-going-port-by-descending-down
+ * Given   : a switch and a LID
+ * Pseudo code: 
+ *    foreach down-going-port-group (in indexing order)
+ *        skip this group if the LFT(LID) port is part of this group
+ *        find the least loaded port of the group (scan in indexing order)
+ *        r-port is the remote port connected to it
+ *        assign the remote switch node LFT(LID) to r-port
+ *        increase r-port usage counter
+ *        assign-up-going-port-by-descending-down to r-port node (recursion)
+ */
+
+static void
+__osm_ftree_fabric_route_upgoing_by_going_down(
+   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_sw_t     * p_sw,
+   IN  ftree_sw_t     * p_prev_sw,
+   IN  ib_net16_t       target_lid,
+   IN  boolean_t        is_real_lid,
+   IN  boolean_t        is_main_path)
+{
+   ftree_sw_t          * p_remote_sw;
+   uint16_t              ports_num;
+   ftree_port_group_t  * p_group;
+   ftree_port_t        * p_port;
+   ftree_port_t        * p_min_port;
+   uint16_t              i;
+   uint16_t              j;
+
+   /* we shouldn't enter here if both real_lid and main_path are false */
+   CL_ASSERT(is_real_lid || is_main_path);
+
+   /* can't be here for leaf switch, */
+   CL_ASSERT(p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1));
+
+   /* if there is no down-going ports */
+   if (p_sw->down_port_groups_num == 0) 
+       return;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down);
+
+   /* foreach down-going port group (in indexing order) */
+   for (i = 0; i < p_sw->down_port_groups_num; i++)
+   {
+      p_group = p_sw->down_port_groups[i];
+
+      if ( p_prev_sw && (p_group->remote_base_lid == p_prev_sw->base_lid) ) 
+      {
+         /* This port group has a port that was used when we entered this switch,
+            which means that the current group points to the switch where we were
+            at the previous step of the algorithm (before going up).
+            Skipping this group. */
+            continue;
+      }
+
+      /* find the least loaded port of the group (in indexing order) */
+      p_min_port = NULL;
+      ports_num = cl_ptr_vector_get_size(&p_group->ports);
+      /* ToDo: no need to select a least loaded port for non-main path.
+         Think about optimization. */
+      for (j = 0; j < ports_num; j++) 
+      {
+          cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port);
+          if (!p_min_port)
+          {
+             /* first port that we're checking - set as port with the lowest load */
+             p_min_port = p_port;
+          }
+          else if (p_port->counter_up < p_min_port->counter_up)
+          {
+             /* this port is less loaded - use it as min */
+             p_min_port = p_port;
+          }
+      }
+      /* At this point we have selected a port in this group with the 
+         lowest load of upgoing routes.
+         Set on the remote switch how to get to the target_lid -
+         set LFT(target_lid) on the remote switch to the remote port */
+      p_remote_sw = p_group->remote_hca_or_sw.remote_sw;
+
+      /* Four possible cases:
+       *
+       *  1. is_real_lid == TRUE && is_main_path == TRUE: 
+       *      - going DOWN(TRUE,TRUE) through ALL the groups
+       *         + promoting port counter
+       *         + setting path in remote switch fwd tbl
+       *      
+       *  2. is_real_lid == TRUE && is_main_path == FALSE: 
+       *      - going DOWN(TRUE,FALSE) through ALL the groups but only if
+       *        the remote (upper) switch hasn't been already configured 
+       *        for this target LID
+       *         + NOT promoting port counter
+       *         + setting path in remote switch fwd tbl if it hasn't been set yet
+       *
+       *  3. is_real_lid == FALSE && is_main_path == TRUE: 
+       *      - going DOWN(FALSE,TRUE) through ALL the groups
+       *         + promoting port counter
+       *         + NOT setting path in remote switch fwd tbl
+       *
+       *  4. is_real_lid == FALSE && is_main_path == FALSE: 
+       *      - illegal state - we shouldn't get here
+       */
+
+      /* second case: skip the port group if the remote (upper)
+         switch has been already configured for this target LID */
+      if ( is_real_lid && !is_main_path &&
+           __osm_ftree_sw_get_fwd_table_block(p_remote_sw,
+                                              cl_ntoh16(target_lid)) != OSM_NO_PATH )
+            continue;
+
+      /* setting fwd tbl port only if this is real LID */
+      if (is_real_lid)
+      {
+         __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
+                                            cl_ntoh16(target_lid),
+                                            p_min_port->remote_port_num);
+         osm_log(&osm.log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_upgoing_by_going_down: "
+                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 __osm_ftree_tuple_to_str(p_remote_sw->tuple),
+                 cl_ntoh16(target_lid),
+                 p_min_port->remote_port_num);
+      }
+   
+      /* The number of upgoing routes is tracked in the 
+         p_port->counter_up counter of the port that belongs to
+         the upper side of the link (on switch with lower rank).
+         Counter is promoted only if we're routing LID on the main
+         path (whether it's a real LID or a dummy one). */
+      if (is_main_path)
+         p_min_port->counter_up++;
+
+      /* Recursion step:
+         Assign upgoing ports by stepping down, starting on REMOTE switch.
+         Recursion stop condition - if the REMOTE switch is a leaf switch. */
+      if (p_remote_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      {
+         __osm_ftree_fabric_route_upgoing_by_going_down(
+               p_ftree,
+               p_remote_sw,   /* remote switch - used as a route-upgoing alg. start point */
+               NULL,          /* prev. position - NULL to mark that we went down and not up */
+               target_lid,    /* LID that we're routing to */
+               is_real_lid,   /* whether the target LID is real or dummy */
+               is_main_path); /* whether this is path to HCA that should by tracked by counters */
+      }
+   }
+   /* done scanning all the down-going port groups */
+
+   OSM_LOG_EXIT(&(osm.log));
+} /* __osm_ftree_fabric_route_upgoing_by_going_down() */
+
+/***************************************************/
+
+/*  
+ * Function: assign-down-going-port-by-descending-up
+ * Given   : a switch and a LID
+ * Pseudo code: 
+ *    find the least loaded port of all the upgoing groups (scan in indexing order)
+ *    assign the LFT(LID) of remote switch to that port
+ *    track that port usage
+ *    assign-up-going-port-by-descending-down on CURRENT switch
+ *    assign-down-going-port-by-descending-up on REMOTE switch (recursion)
+ */
+
+static void
+__osm_ftree_fabric_route_downgoing_by_going_up(
+   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_sw_t     * p_sw,
+   IN  ftree_sw_t     * p_prev_sw,
+   IN  ib_net16_t       target_lid,
+   IN  boolean_t        is_real_lid,
+   IN  boolean_t        is_main_path)
+{
+   ftree_sw_t          * p_remote_sw;
+   uint16_t              ports_num;
+   ftree_port_group_t  * p_group;
+   ftree_port_t        * p_port;
+   ftree_port_group_t  * p_min_group;
+   ftree_port_t        * p_min_port;
+   uint16_t              i;
+   uint16_t              j;
+
+   /* we shouldn't enter here if both real_lid and main_path are false */
+   CL_ASSERT(is_real_lid || is_main_path);
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up);
+
+   /* If this switch isn't a leaf switch:
+      Assign upgoing ports by stepping down, starting on THIS switch. */
+   if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+   {
+      __osm_ftree_fabric_route_upgoing_by_going_down(
+         p_ftree,
+         p_sw,          /* local switch - used as a route-upgoing alg. start point */
+         p_prev_sw,     /* switch that we went up from (NULL means that we went down) */
+         target_lid,    /* LID that we're routing to */
+         is_real_lid,   /* whether this target LID is real or dummy */
+         is_main_path); /* whether this path to HCA should by tracked by counters */
+   }
+
+   /* recursion stop condition - if it's a root switch, */
+   if (p_sw->rank == 0)
+   {
+      OSM_LOG_EXIT(&(osm.log));
+      return;
+   }
+
+   /* Find the least loaded port of all the upgoing port groups
+      (in indexing order of the remote switches). */
+   p_min_group = NULL;
+   p_min_port = NULL;
+   for (i = 0; i < p_sw->up_port_groups_num; i++)
+   {
+      p_group = p_sw->up_port_groups[i];
+
+      ports_num = cl_ptr_vector_get_size(&p_group->ports);
+      for (j = 0; j < ports_num; j++)
+      {
+         cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port);
+         if (!p_min_group)
+         {
+            /* first port that we're checking - use
+               it as a port with the lowest load */
+            p_min_group = p_group;
+            p_min_port = p_port;
+         }
+         else
+         { 
+            if ( p_port->counter_down < p_min_port->counter_down  )
+            {
+               /* this port is less loaded - use it as min */
+               p_min_group = p_group;
+               p_min_port = p_port;
+            }
+         }
+      }
+   }
+
+   /* At this point we have selected a group and port with the 
+      lowest load of downgoing routes.
+      Set on the remote switch how to get to the target_lid -
+      set LFT(target_lid) on the remote switch to the remote port */
+   p_remote_sw = p_min_group->remote_hca_or_sw.remote_sw;
+
+   /* Four possible cases:
+    *
+    *  1. is_real_lid == TRUE && is_main_path == TRUE: 
+    *      - going UP(TRUE,TRUE) on selected min_group and min_port
+    *         + promoting port counter
+    *         + setting path in remote switch fwd tbl
+    *      - going UP(TRUE,FALSE) on rest of the groups, each time on port 0
+    *         + NOT promoting port counter
+    *         + setting path in remote switch fwd tbl if it hasn't been set yet
+    *      
+    *  2. is_real_lid == TRUE && is_main_path == FALSE: 
+    *      - going UP(TRUE,FALSE) on ALL the groups, each time on port 0,
+    *        but only if the remote (upper) switch hasn't been already 
+    *        configured for this target LID
+    *         + NOT promoting port counter
+    *         + setting path in remote switch fwd tbl if it hasn't been set yet
+    *
+    *  3. is_real_lid == FALSE && is_main_path == TRUE: 
+    *      - going UP(FALSE,TRUE) ONLY on selected min_group and min_port
+    *         + promoting port counter
+    *         + NOT setting path in remote switch fwd tbl
+    *
+    *  4. is_real_lid == FALSE && is_main_path == FALSE: 
+    *      - illegal state - we shouldn't get here
+    */
+
+   /* covering first half of case 1, and case 3 */
+   if (is_main_path)
+   {
+      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      {
+         osm_log(&osm.log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_downgoing_by_going_up: "
+                 " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n",
+                 (is_real_lid)? "real" : "DUMMY",
+                 cl_ntoh16(target_lid),
+                 __osm_ftree_tuple_to_str(p_sw->tuple),
+                 __osm_ftree_tuple_to_str(p_remote_sw->tuple));
+      }
+      /* The number of downgoing routes is tracked in the 
+         p_port->counter_down counter of the port that belongs to
+         the lower side of the link (on switch with higher rank) */
+      p_min_port->counter_down++;
+      if (is_real_lid)
+      {
+         __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
+                                            cl_ntoh16(target_lid),
+                                            p_min_port->remote_port_num);
+         p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num;
+         osm_log(&osm.log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_downgoing_by_going_up: "
+                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 __osm_ftree_tuple_to_str(p_remote_sw->tuple),
+                 cl_ntoh16(target_lid),p_min_port->remote_port_num);
+      }
+
+      /* Recursion step: 
+         Assign downgoing ports by stepping up, starting on REMOTE switch. */
+      __osm_ftree_fabric_route_downgoing_by_going_up(
+            p_ftree,
+            p_remote_sw,    /* remote switch - used as a route-downgoing alg. next step point */
+            p_sw,           /* this switch - prev. position switch for the function */
+            target_lid,     /* LID that we're routing to */
+            is_real_lid,    /* whether this target LID is real or dummy */
+            is_main_path);  /* whether this is path to HCA that should by tracked by counters */
+   }
+
+   /* we're done for the third case */
+   if (!is_real_lid)
+   {
+      OSM_LOG_EXIT(&(osm.log));
+      return;
+   }
+
+   /* What's left to do at this point:
+    *
+    *  1. is_real_lid == TRUE && is_main_path == TRUE: 
+    *      - going UP(TRUE,FALSE) on rest of the groups, each time on port 0, 
+    *        but only if the remote (upper) switch hasn't been already 
+    *        configured for this target LID
+    *         + NOT promoting port counter
+    *         + setting path in remote switch fwd tbl if it hasn't been set yet
+    *      
+    *  2. is_real_lid == TRUE && is_main_path == FALSE: 
+    *      - going UP(TRUE,FALSE) on ALL the groups, each time on port 0,
+    *        but only if the remote (upper) switch hasn't been already 
+    *        configured for this target LID
+    *         + NOT promoting port counter
+    *         + setting path in remote switch fwd tbl if it hasn't been set yet
+    *
+    *  These two rules can be rephrased this way:
+    *   - foreach UP port group
+    *      + if remote switch has been set with the target LID
+    *         - skip this port group
+    *      + else
+    *         - select port 0
+    *         - do NOT promote port counter
+    *         - set path in remote switch fwd tbl
+    *         - go UP(TRUE,FALSE) to the remote switch
+    */
+
+   for (i = 0; i < p_sw->up_port_groups_num; i++)
+   {
+      p_group = p_sw->up_port_groups[i];
+      p_remote_sw = p_group->remote_hca_or_sw.remote_sw;
+
+      /* skip if target lid has been already set on remote switch fwd tbl */
+      if (__osm_ftree_sw_get_fwd_table_block(
+                  p_remote_sw,cl_ntoh16(target_lid)) != OSM_NO_PATH)
+         continue;
+
+      if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
+      {
+         osm_log(&osm.log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_downgoing_by_going_up: "
+                 " - Routing SECONDARY path for LID 0x%x: %s --> %s\n",
+                cl_ntoh16(target_lid),
+                __osm_ftree_tuple_to_str(p_sw->tuple),
+                __osm_ftree_tuple_to_str(p_remote_sw->tuple));
+      }
+    
+      cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port);
+      __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
+                                         cl_ntoh16(target_lid),
+                                         p_port->remote_port_num);
+      /* Recursion step: 
+         Assign downgoing ports by stepping up, starting on REMOTE switch. */
+      __osm_ftree_fabric_route_downgoing_by_going_up(
+            p_ftree,
+            p_remote_sw, /* remote switch - used as a route-downgoing alg. next step point */
+            p_sw,        /* this switch - prev. position switch for the function */
+            target_lid,  /* LID that we're routing to */
+            TRUE,        /* whether the target LID is real or dummy */
+            FALSE);      /* whether this is path to HCA that should by tracked by counters */
+   }
+
+   OSM_LOG_EXIT(&(osm.log));
+} /* ftree_fabric_route_downgoing_by_going_up() */
+
+/***************************************************/
+
+/*  
+ * Pseudo code: 
+ *    foreach leaf switch (in indexing order)
+ *       for each compute node (in indexing order)
+ *          obtain the LID of the compute node
+ *          set local LFT(LID) of the port connecting to compute node
+ *          call assign-down-going-port-by-descending-up(TRUE,TRUE) on CURRENT switch
+ *       for each MISSING compute node
+ *          call assign-down-going-port-by-descending-up(FALSE,TRUE) on CURRENT switch
+ */
+
+static void
+__osm_ftree_fabric_route_to_hcas(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t         * p_sw;
+   ftree_port_group_t * p_group;
+   ftree_port_t       * p_port;
+   uint32_t             i;
+   uint32_t             j;
+   ib_net16_t           remote_lid;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas);
+
+   /* for each leaf switch (in indexing order) */
+   for(i = 0; i < p_ftree->leaf_switches_num; i++)
+   {
+      p_sw = p_ftree->leaf_switches[i];
+
+      /* for each HCA connected to this switch */
+      for (j = 0; j < p_sw->down_port_groups_num; j++)
+      {
+         /* obtain the LID of HCA port */
+         p_group = p_sw->down_port_groups[j];
+         remote_lid = p_group->remote_base_lid;
+
+         /* set local LFT(LID) to the port that is connected to HCA */
+         cl_ptr_vector_at(&p_group->ports, 0, (void **)&p_port);
+         __osm_ftree_sw_set_fwd_table_block(p_sw,
+                                            cl_ntoh16(remote_lid),
+                                            p_port->port_num);
+         osm_log(&osm.log, OSM_LOG_DEBUG,
+                 "__osm_ftree_fabric_route_to_hcas: "
+                 "Switch %s: set path to HCA LID 0x%x through port %u\n",
+                 __osm_ftree_tuple_to_str(p_sw->tuple),
+                 cl_ntoh16(remote_lid),
+                 p_port->port_num);
+
+         /* assign downgoing ports by stepping up */
+         __osm_ftree_fabric_route_downgoing_by_going_up(
+               p_ftree,
+               p_sw,       /* local switch - used as a route-downgoing alg. start point */
+               NULL,       /* prev. position switch */
+               remote_lid, /* LID that we're routing to */
+               TRUE,       /* whether this HCA LID is real or dummy */
+               TRUE);      /* whether this path to HCA should by tracked by counters */
+      }
+
+      /* We're done with the real HCAs. Now route the dummy HCAs that are missing.*/
+
+      if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
+      {
+         osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
+                 "Routing %u dummy HCAs\n",
+                 p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
+         for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++)
+         {
+            /* assign downgoing ports by stepping up */
+            __osm_ftree_fabric_route_downgoing_by_going_up(
+                  p_ftree,
+                  p_sw,    /* local switch - used as a route-downgoing alg. start point */
+                  NULL,    /* prev. position switch */
+                  0,       /* LID that we're routing to - ignored for dummy HCA */
+                  FALSE,   /* whether this HCA LID is real or dummy */
+                  TRUE);   /* whether this path to HCA should by tracked by counters */
+         }
+      }
+   }
+   /* done going through all the leaf switches */
+   OSM_LOG_EXIT(&(osm.log));
+} /* __osm_ftree_fabric_route_to_hcas() */
+
+/***************************************************/
+
+/*  
+ * Pseudo code: 
+ *    foreach switch in fabric
+ *       obtain its LID
+ *       set local LFT(LID) to port 0
+ *       call assign-down-going-port-by-descending-up(TRUE,FALSE) on CURRENT switch
+ *
+ * Routing to switch is similar to routing a REAL hca lid on SECONDARY path:
+ *   - we should set fwd tables
+ *   - we should NOT update port counters
+ */
+
+static void
+__osm_ftree_fabric_route_to_switches(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_sw_t         * p_sw;
+   ftree_sw_t         * p_next_sw;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches);
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
+   {
+      p_sw = p_next_sw;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+
+      /* set local LFT(LID) to 0 (route to itself) */
+      __osm_ftree_sw_set_fwd_table_block(p_sw,
+                                         cl_ntoh16(p_sw->base_lid),
+                                         0);
+
+      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: "
+              "Switch %s (LID 0x%x): routing switch-to-switch pathes\n",
+              __osm_ftree_tuple_to_str(p_sw->tuple),
+              cl_ntoh16(p_sw->base_lid));
+
+      __osm_ftree_fabric_route_downgoing_by_going_up(
+            p_ftree,
+            p_sw,           /* local switch - used as a route-downgoing alg. start point */
+            NULL,           /* prev. position switch */
+            p_sw->base_lid, /* LID that we're routing to */
+            TRUE,           /* whether the target LID is a real or dummy */
+            FALSE);         /* whether this path should by tracked by counters */
+   }
+
+   OSM_LOG_EXIT(&(osm.log));
+} /* __osm_ftree_fabric_route_to_switches() */
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_fabric_populate_switches(
+   IN  ftree_fabric_t * p_ftree)
+{
+   osm_switch_t * p_osm_sw;
+   osm_switch_t * p_next_osm_sw;
+   osm_opensm_t * p_osm = &osm;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches);
+
+   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl);
+   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) )
+   {
+      p_osm_sw = p_next_osm_sw;
+      p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item );
+      __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw);
+   }
+   OSM_LOG_EXIT(&(osm.log));
+   return 0;
+} /* __osm_ftree_fabric_populate_switches() */
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_fabric_populate_hcas(
+   IN  ftree_fabric_t * p_ftree)
+{
+   osm_node_t   * p_osm_node;
+   osm_node_t   * p_next_osm_node;
+   osm_opensm_t * p_osm = &osm;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas);
+
+   p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl);
+   while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) )
+   {
+      p_osm_node = p_next_osm_node;
+      p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item);
+      switch (osm_node_get_type(p_osm_node))
+      {
+         case IB_NODE_TYPE_CA:
+            __osm_ftree_fabric_add_hca(p_ftree,p_osm_node);
+            break;
+         case IB_NODE_TYPE_ROUTER:
+            break;
+         case IB_NODE_TYPE_SWITCH:
+            /* all the switches added separately */
+            break;
+         default:
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: "
+                    "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
+                    cl_ntoh64(osm_node_get_node_guid(p_osm_node)),
+                    ib_get_node_type_str(osm_node_get_type(p_osm_node)));
+            OSM_LOG_EXIT(&(osm.log));
+            return -1;
+      }
+   }
+
+   OSM_LOG_EXIT(&(osm.log));
+   return 0;
+} /* __osm_ftree_fabric_populate_hcas() */
+
+/***************************************************
+ ***************************************************/
+
+static void
+__osm_ftree_rank_from_switch(
+   IN  ftree_fabric_t * p_ftree, 
+   IN  ftree_sw_t *     p_starting_sw)
+{
+   ftree_sw_t   * p_sw;
+   ftree_sw_t   * p_remote_sw;
+   osm_node_t   * p_node;
+   osm_node_t   * p_remote_node;
+   osm_physp_t  * p_osm_port;
+   uint16_t       i;
+   cl_list_t      bfs_list;
+   ftree_sw_tbl_element_t * p_sw_tbl_element = NULL;
+
+   p_starting_sw->rank = 0;
+
+   /* Run BFS scan of the tree, starting from this switch */
+
+   cl_list_init(&bfs_list, cl_qmap_count(&p_ftree->sw_tbl));
+   cl_list_insert_tail(&bfs_list, &__osm_ftree_sw_tbl_element_create(p_starting_sw)->map_item);
+
+   while (!cl_is_list_empty(&bfs_list))
+   {
+      p_sw_tbl_element = (ftree_sw_tbl_element_t *)cl_list_remove_head(&bfs_list);
+      p_sw = p_sw_tbl_element->p_sw;
+      __osm_ftree_sw_tbl_element_destroy(p_sw_tbl_element);
+
+      p_node = osm_switch_get_node_ptr(p_sw->p_osm_sw);
+
+      /* note: skipping port 0 on switches */
+      for (i = 1; i < osm_node_get_num_physp(p_node); i++)
+      {
+         p_osm_port = osm_node_get_physp_ptr(p_node,i);
+         if (!osm_physp_is_valid(p_osm_port)) 
+            continue;
+         if (!osm_link_is_healthy(p_osm_port)) 
+            continue;
+
+         p_remote_node = osm_node_get_remote_node(p_node,i,NULL);
+         if (!p_remote_node)
+            continue;
+         if (osm_node_get_type(p_remote_node) != IB_NODE_TYPE_SWITCH)
+            continue;
+
+         p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,
+                                                 osm_node_get_node_guid(p_remote_node));
+         if (p_remote_sw == (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl))
+         {
+            /* remote node is not a switch */
+            continue;
+         }
+         if (__osm_ftree_sw_ranked(p_remote_sw) && p_remote_sw->rank <= (p_sw->rank + 1))
+            continue;
+
+         /* rank the remote switch and add it to the BFS list */
+         p_remote_sw->rank = p_sw->rank + 1;
+         cl_list_insert_tail(&bfs_list, 
+                             &__osm_ftree_sw_tbl_element_create(p_remote_sw)->map_item);
+      }
+   }
+} /* __osm_ftree_rank_from_switch() */
+
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_rank_switches_from_hca(
+   IN  ftree_fabric_t * p_ftree,
+   IN  ftree_hca_t    * p_hca)
+{
+   ftree_sw_t     * p_sw;
+   osm_node_t     * p_osm_node = p_hca->p_osm_node;
+   osm_node_t     * p_remote_osm_node;
+   osm_physp_t    * p_osm_port;
+   static uint16_t i = 0;
+   int res = 0;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca);
+
+   for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++)
+   {
+      p_osm_port = osm_node_get_physp_ptr(p_osm_node,i);
+      if (!osm_physp_is_valid(p_osm_port)) 
+         continue;
+      if (!osm_link_is_healthy(p_osm_port)) 
+         continue;
+
+      p_remote_osm_node = osm_node_get_remote_node(p_osm_node,i,NULL);
+
+      switch (osm_node_get_type(p_remote_osm_node))
+      {
+         case IB_NODE_TYPE_CA:
+            /* HCA connected directly to another HCA - not FatTree */
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: "
+                    "HCA conected directly to another HCA: "
+                    "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
+                    cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
+                    cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)));
+            res = -1;
+            goto Exit;
+
+         case IB_NODE_TYPE_ROUTER:
+            /* leaving this port - proceeding to the next one */
+            continue;
+
+         case IB_NODE_TYPE_SWITCH:
+            /* continue with this port */
+            break;
+
+         default:
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: "
+                    "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
+                    cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)),
+                    ib_get_node_type_str(osm_node_get_type(p_remote_osm_node)));
+            res = -1;
+            goto Exit;
+      }
+
+      /* remote node is switch */
+
+      p_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,
+                                       p_osm_port->p_remote_physp->p_node->node_info.node_guid);
+
+      CL_ASSERT(p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
+
+      if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0)
+         continue;
+
+      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: "
+              "Marking rank of switch that is directly connected to HCA:\n"
+              "                                            - HCA guid   : 0x%016" PRIx64 "\n"
+              "                                            - Switch guid: 0x%016" PRIx64 "\n"
+              "                                            - Switch LID : 0x%x\n",
+              cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
+              cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
+              cl_ntoh16(p_sw->base_lid));
+      __osm_ftree_rank_from_switch(p_ftree, p_sw);
+   }
+
+ Exit:
+   OSM_LOG_EXIT(&(osm.log));
+   return res;
+} /* __osm_ftree_rank_switches_from_hca() */
+
+/***************************************************/
+
+static void 
+__osm_ftree_sw_reverse_rank(
+   IN  cl_map_item_t* const p_map_item, 
+   IN  void *context)
+{
+   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+   ftree_sw_t     * p_sw = (ftree_sw_t * const) p_map_item;
+   p_sw->rank = __osm_ftree_fabric_get_rank(p_ftree) - p_sw->rank - 1;
+}
+
+/***************************************************
+ ***************************************************/
+
+static int
+__osm_ftree_fabric_construct_hca_ports(
+   IN  ftree_fabric_t  * p_ftree, 
+   IN  ftree_hca_t     * p_hca)
+{
+   ftree_sw_t      * p_remote_sw;
+   osm_node_t      * p_node = p_hca->p_osm_node;
+   osm_node_t      * p_remote_node;
+   uint8_t           remote_node_type;
+   ib_net64_t        remote_node_guid;
+   osm_physp_t     * p_remote_osm_port;
+   uint16_t          i;
+   uint8_t           remote_port_num;
+   int res = 0;
+
+   for (i = 0; i < osm_node_get_num_physp(p_node); i++)
+   {
+      osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i);
+
+      if (!osm_physp_is_valid(p_osm_port)) 
+         continue;
+      if (!osm_link_is_healthy(p_osm_port)) 
+         continue;
+
+      p_remote_osm_port = osm_physp_get_remote(p_osm_port);
+      p_remote_node = osm_node_get_remote_node(p_node,i,&remote_port_num);
+
+      if (!p_remote_osm_port)
+         continue;
+
+      remote_node_type = osm_node_get_type(p_remote_node);
+      remote_node_guid = osm_node_get_node_guid(p_remote_node);
+
+      switch (remote_node_type)
+      {
+         case IB_NODE_TYPE_ROUTER:
+            /* leaving this port - proceeding to the next one */
+            continue;
+
+         case IB_NODE_TYPE_CA:
+            /* HCA connected directly to another HCA - not FatTree */
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
+                    "HCA conected directly to another HCA: "
+                    "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
+                    cl_ntoh64(osm_node_get_node_guid(p_node)),
+                    cl_ntoh64(remote_node_guid));
+            res = -1;
+            goto Exit;
+
+         case IB_NODE_TYPE_SWITCH:
+            /* continue with this port */
+            break;
+
+         default:
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: "
+                    "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
+                    cl_ntoh64(remote_node_guid),
+                    ib_get_node_type_str(remote_node_type));
+            res = -1;
+            goto Exit;
+      }
+
+      /* remote node is switch */
+
+      p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid);
+      CL_ASSERT( p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) );
+      CL_ASSERT( (p_remote_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree) );
+
+      __osm_ftree_hca_add_port(
+            p_hca,                                     /* local ftree_hca object */
+            i,                                         /* local port number */
+            remote_port_num,                           /* remote port number */
+            osm_node_get_base_lid(p_node, i),          /* local lid */
+            osm_node_get_lmc(p_node, i),               /* local lmc */
+            osm_node_get_base_lid(p_remote_node, 0),   /* remote lid */
+            osm_node_get_lmc(p_remote_node, 0),        /* remote lmc */
+            osm_physp_get_port_guid(p_osm_port),       /* local port guid */
+            osm_physp_get_port_guid(p_remote_osm_port),/* remote port guid */
+            remote_node_guid,                          /* remote node guid */
+            remote_node_type,                          /* remote node type */
+            (void *) p_remote_sw);                     /* remote ftree_hca/sw object */
+   }
+
+ Exit:
+   return res;
+} /* __osm_ftree_fabric_construct_hca_ports() */
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_fabric_construct_sw_ports(
+   IN  ftree_fabric_t  * p_ftree, 
+   IN  ftree_sw_t      * p_sw)
+{
+   ftree_hca_t       * p_remote_hca;
+   ftree_sw_t        * p_remote_sw;
+   osm_node_t        * p_node = osm_switch_get_node_ptr(p_sw->p_osm_sw);
+   osm_node_t        * p_remote_node;
+   ib_net16_t          remote_base_lid;
+   uint8_t             remote_lmc;
+   uint8_t             remote_node_type;
+   ib_net64_t          remote_node_guid;
+   osm_physp_t       * p_remote_osm_port;
+   ftree_direction_t   direction;
+   void              * p_remote_hca_or_sw;
+   uint16_t            i;
+   uint8_t             remote_port_num;
+   int res = 0;
+
+   CL_ASSERT(osm_node_get_type(p_node) == IB_NODE_TYPE_SWITCH);
+
+   for (i = 0; i < osm_node_get_num_physp(p_node); i++)
+   {
+      osm_physp_t * p_osm_port = osm_node_get_physp_ptr(p_node,i);
+
+      if (!osm_physp_is_valid(p_osm_port)) 
+         continue;
+      if (!osm_link_is_healthy(p_osm_port)) 
+         continue;
+
+      p_remote_osm_port = osm_physp_get_remote(p_osm_port);
+      p_remote_node = osm_node_get_remote_node(p_node,i,&remote_port_num);
+
+      if (!p_remote_osm_port)
+         continue;
+
+      remote_node_type = osm_node_get_type(p_remote_node);
+      remote_node_guid = osm_node_get_node_guid(p_remote_node);
+
+      switch (remote_node_type)
+      {
+         case IB_NODE_TYPE_ROUTER:
+            /* leaving this port - proceeding to the next one */
+            continue;
+
+         case IB_NODE_TYPE_CA:
+            /* switch connected to hca */
+
+            CL_ASSERT((p_sw->rank + 1) == __osm_ftree_fabric_get_rank(p_ftree));
+
+            p_remote_hca = (ftree_hca_t *)cl_qmap_get(&p_ftree->hca_tbl,remote_node_guid);
+            CL_ASSERT(p_remote_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl));
+
+            p_remote_hca_or_sw = (void *)p_remote_hca;
+            direction = FTREE_DIRECTION_DOWN;
+
+            remote_base_lid = osm_physp_get_base_lid(p_remote_osm_port);
+            remote_lmc = osm_physp_get_lmc(p_remote_osm_port);
+            break;
+
+         case IB_NODE_TYPE_SWITCH:
+            /* switch connected to another switch */
+
+            p_remote_sw = (ftree_sw_t *)cl_qmap_get(&p_ftree->sw_tbl,remote_node_guid);
+            CL_ASSERT(p_remote_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl));
+            CL_ASSERT(abs(p_sw->rank - p_remote_sw->rank) == 1);
+            p_remote_hca_or_sw = (void *)p_remote_sw;
+
+            if (p_sw->rank > p_remote_sw->rank)
+               direction = FTREE_DIRECTION_UP;
+            else
+               direction = FTREE_DIRECTION_DOWN;
+
+            /* switch LID is only in port 0 port_info structure */
+            remote_base_lid = osm_node_get_base_lid(p_remote_node, 0);
+            remote_lmc = osm_node_get_lmc(p_remote_node, 0);
+
+            break;
+
+         default:
+            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: "
+                    "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
+                    cl_ntoh64(remote_node_guid),
+                    ib_get_node_type_str(remote_node_type));
+            res = -1;
+            goto Exit;
+      }
+      __osm_ftree_sw_add_port(
+            p_sw,                                       /* local ftree_sw object */     
+            i,                                          /* local port number */          
+            remote_port_num,                            /* remote port number */         
+            p_sw->base_lid,                             /* local lid */                  
+            p_sw->lmc,                                  /* local lmc */                  
+            remote_base_lid,                            /* remote lid */                 
+            remote_lmc,                                 /* remote lmc */                 
+            osm_physp_get_port_guid(p_osm_port),        /* local port guid */            
+            osm_physp_get_port_guid(p_remote_osm_port), /* remote port guid */           
+            remote_node_guid,                           /* remote node guid */           
+            remote_node_type,                           /* remote node type */           
+            p_remote_hca_or_sw,                         /* remote ftree_hca/sw object */ 
+            direction);                                 /* port direction (up or down) */
+   }
+
+ Exit:
+   return res;
+} /* __osm_ftree_fabric_construct_sw_ports() */
+
+/***************************************************
+ ***************************************************/
+
+/* ToDo: improve ranking algorithm complexity
+   by propogating BFS from more nodes */ 
+static int
+__osm_ftree_fabric_perform_ranking(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_hca_t * p_hca;
+   ftree_hca_t * p_next_hca;
+   int res = 0;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking);
+
+   /* Mark REVERSED rank of all the switches in the subnet. 
+      Start from switches that are connected to hca's, and 
+      scan all the switches in the subnet. */
+   p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+   while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
+   {
+      p_hca = p_next_hca;
+      p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item );
+      if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0)
+      {
+         res = -1;
+         osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: "
+                 "Subnet ranking failed - subnet is not FatTree");
+         goto Exit;
+      }
+   }
+
+   /* calculate and set FatTree rank */
+   __osm_ftree_fabric_calculate_rank(p_ftree);
+   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: "
+           "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree));
+   
+   /* fix ranking of the switches by reversing the ranking direction */
+   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_sw_reverse_rank, (void *)p_ftree);
+
+   if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
+        __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
+   {
+      osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: "
+              "Tree rank is %u (should be between %u and %u)\n",
+              __osm_ftree_fabric_get_rank(p_ftree),
+              FAT_TREE_MIN_RANK,
+              FAT_TREE_MAX_RANK);
+      res = -1;
+      goto Exit;
+   }
+
+  Exit:
+   OSM_LOG_EXIT(&(osm.log));
+   return res;
+} /* __osm_ftree_fabric_perform_ranking() */
+
+/***************************************************
+ ***************************************************/
+
+static int
+__osm_ftree_fabric_populate_ports(
+   IN  ftree_fabric_t * p_ftree)
+{
+   ftree_hca_t * p_hca;
+   ftree_hca_t * p_next_hca;
+   ftree_sw_t * p_sw;
+   ftree_sw_t * p_next_sw;
+   int res = 0;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports);
+
+   p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
+   while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
+   {
+      p_hca = p_next_hca;
+      p_next_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item );
+      if (__osm_ftree_fabric_construct_hca_ports(p_ftree,p_hca) != 0)
+      {
+         res = -1;
+         goto Exit;
+      }
+   }
+
+   p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
+   while( p_next_sw != (ftree_sw_t *)cl_qmap_end( &p_ftree->sw_tbl ) )
+   {
+      p_sw = p_next_sw;
+      p_next_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item );
+      if (__osm_ftree_fabric_construct_sw_ports(p_ftree,p_sw) != 0)
+      {
+         res = -1;
+         goto Exit;
+      }
+   }
+ Exit:
+   OSM_LOG_EXIT(&(osm.log));
+   return res;
+} /* __osm_ftree_fabric_populate_ports() */
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_do_routing(void *context)
+{
+   ftree_fabric_t * p_ftree = context;
+   int status = 0;
+
+   OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing);
+
+   if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 )
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric has %u switches - topology is not fat-tree.\n"
+              "Falling back to default routing.\n",
+              cl_qmap_count(&osm.subn.sw_guid_tbl));
+      status = -1;
+      goto Exit;
+   }
+
+   if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - 
+         cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2)
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n"
+              "Falling back to default routing.\n",
+              cl_qmap_count(&osm.subn.node_guid_tbl),
+              cl_qmap_count(&osm.subn.sw_guid_tbl));
+      status = -1;
+      goto Exit;
+   }
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
+           "                       |------------------------------|\n"
+           "                       |-  Starting FatTree Routing  -|\n"
+           "                       |------------------------------|\n\n");
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Populating FatTree switch table\n");
+   /* ToDo: now that the pointer from node to switch exists,  
+      no need to fill the switch table in a separate loop */
+   if (__osm_ftree_fabric_populate_switches(p_ftree) != 0)
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric topology is not fat-tree - "
+              "falling back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Populating FatTree HCA table\n");
+   if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0)
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric topology is not fat-tree - "
+              "falling back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
+   if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric has %u HCAa - topology is not fat-tree.\n"
+              "Falling back to default routing.\n",
+              cl_qmap_count(&p_ftree->hca_tbl));
+      status = -1;
+      goto Exit;
+   }
+
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Ranking FatTree\n");
+   if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
+   {
+      if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
+         osm_log(&osm.log, OSM_LOG_SYS,
+                 "Fabric rank is %u (>%u) - "
+                 "fat-tree routing falls back to default routing\n",
+                 __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK);
+      status = -1;
+      goto Exit;
+   }
+
+   /* For each hca and switch, construct array of ports.
+      This is done after the whole FatTree data structure is ready, because
+      we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Populating HCA & switch ports\n");
+   if (__osm_ftree_fabric_populate_ports(p_ftree) != 0)
+   {
+      osm_log(&osm.log, OSM_LOG_SYS,
+              "Fabric topology is not a fat-tree - "
+              "routing falls back to default routing\n");
+      status = -1;
+      goto Exit;
+   }
+
+   /* Assign index to all the switches and hca's in the fabric.
+      This function also sorts all the port arrays of the switches
+      by the remote switch index, creates a leaf switch array
+      sorted by the switch index, and tracks the maximal number of
+      hcas per leaf switch. */
+   __osm_ftree_fabric_make_indexing(p_ftree);
+
+   /* print general info about fabric topology */
+   __osm_ftree_fabric_dump_general_info(p_ftree);
+
+   /* dump full tree topology */
+   if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG))
+       __osm_ftree_fabric_dump(p_ftree);
+
+   if (! __osm_ftree_fabric_validate_topology(p_ftree))
+   {
+      status = -1;
+      goto Exit;
+   }
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Filling switch forwarding tables for routes to HCAs\n");
+   __osm_ftree_fabric_route_to_hcas(p_ftree);
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Filling switch forwarding tables for switch-to-switch pathes\n");
+   __osm_ftree_fabric_route_to_switches(p_ftree);
+
+   /* for each switch, set its fwd table */
+   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL);
+
+   /* write out hca ordering file */
+   __osm_ftree_fabric_dump_hca_ordering(p_ftree);
+
+ Exit:
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Clearing FatTree Fabric data structures\n");
+   __osm_ftree_fabric_clear(p_ftree);
+
+   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
+           "                       |---------------------------------------|\n"
+           "                       |-  Done FatTree Routing (status = %d)  -|\n"
+           "                       |---------------------------------------|\n\n", status);
+
+   OSM_LOG_EXIT(&(osm.log));
+   return status;
+}
+
+/***************************************************
+ ***************************************************/
+
+static void 
+__osm_ftree_delete(void * context)
+{
+   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+   if (!p_ftree)
+      return;
+
+   __osm_ftree_fabric_destroy(p_ftree);
+
+}
+
+/***************************************************
+ ***************************************************/
+
+int osm_ucast_ftree_setup(osm_opensm_t * p_osm)
+{
+   ftree_fabric_t * p_ftree = __osm_ftree_fabric_create();
+   if (!p_ftree)
+      return -1;
+
+   p_osm->routing_engine.context = (void *)p_ftree;
+   p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing;
+   p_osm->routing_engine.delete = __osm_ftree_delete;
+   /* ToDo: fat-tree routing doesn't use min_hop tables, so we
+      shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */
+   return 0;
+}
+
+/***************************************************
+ ***************************************************/
+
-- 
1.4.4.1.GIT


From bugzilla-daemon at openib.org  Thu Dec 14 15:59:45 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 15:59:45 -0800 (PST)
Subject: [openib-general] [Bug 172] Need an interface to load alternate path
	to RC QP
Message-ID: <20061214235945.302852283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=172


venkatesh.babu at 3leafnetworks.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From bugzilla-daemon at openib.org  Thu Dec 14 16:00:13 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 16:00:13 -0800 (PST)
Subject: [openib-general] [Bug 160] OFED1.0: ib_modify_qp() of RC QP fails
	with -EINVAL
Message-ID: <20061215000013.BB99C2283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=160


venkatesh.babu at 3leafnetworks.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |CLOSED


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From sean.hefty at intel.com  Thu Dec 14 16:18:55 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 16:18:55 -0800
Subject: [openib-general] [RFC] [PATCH 0/1] ib_sa: add InformInfo
 registration for Notice reports
Message-ID: <000501c71fde$9c1c7560$8698070a@amr.corp.intel.com>

The following patch adds support to the ib_sa to allow users to register for
asynchronous events (traps and reports) from the SA.  The approach is similar to
that used by QLogic and suggested by Venkatesh, with the implementation based on
the approach used by multicast registration.

Users register to receive notices for a particular generic trap number.  The
notice sub-module tracks the number of registration requests for a given trap
number.  When necessary, un/registration requests are sent to the SA.

During initialization, the ib_sa module registers to receive unsolicited notice
reports.  When a notice is received, it is given to the notice sub-module for
dispatching.  The ib_sa generates a response to the notice report.

This patch is also available from my rdma_dev git tree, under the informinfo
branch.

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From sean.hefty at intel.com  Thu Dec 14 16:20:20 2006
From: sean.hefty at intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 16:20:20 -0800
Subject: [openib-general] [RFC] [PATCH 1/1] ib_sa: add InformInfo
 registration for Notice reports
In-Reply-To: <000501c71fde$9c1c7560$8698070a@amr.corp.intel.com>
Message-ID: <000601c71fde$ce561fe0$8698070a@amr.corp.intel.com>

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 189e5d4..2e9c4b2 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -12,7 +12,7 @@ ib_core-y :=			packer.o ud_header.o verb
 
 ib_mad-y :=			mad.o smi.o agent.o mad_rmpp.o
 
-ib_sa-y :=			sa_query.o multicast.o
+ib_sa-y :=			sa_query.o multicast.o notice.o
 
 ib_cm-y :=			cm.o
 
diff --git a/drivers/infiniband/core/notice.c b/drivers/infiniband/core/notice.c
new file mode 100644
index 0000000..038878d
--- /dev/null
+++ b/drivers/infiniband/core/notice.c
@@ -0,0 +1,750 @@
+/*
+ * Copyright (c) 2006 Intel Corporation.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include <linux/completion.h>
+#include <linux/dma-mapping.h>
+#include <linux/err.h>
+#include <linux/interrupt.h>
+#include <linux/pci.h>
+#include <linux/bitops.h>
+#include <linux/random.h>
+
+#include "sa.h"
+
+MODULE_AUTHOR("Sean Hefty");
+MODULE_DESCRIPTION("InfiniBand InformInfo & Notice event handling");
+MODULE_LICENSE("Dual BSD/GPL");
+
+static void inform_add_one(struct ib_device *device);
+static void inform_remove_one(struct ib_device *device);
+
+static struct ib_client inform_client = {
+	.name   = "ib_notice",
+	.add    = inform_add_one,
+	.remove = inform_remove_one
+};
+
+static struct ib_sa_client	sa_client;
+static struct ib_event_handler	event_handler;
+static struct workqueue_struct	*inform_wq;
+
+struct inform_device;
+
+struct inform_port {
+	struct inform_device	*dev;
+	spinlock_t		lock;
+	struct rb_root		table;
+	atomic_t		refcount;
+	struct completion	comp;
+	u8			port_num;
+};
+
+struct inform_device {
+	struct ib_device	*device;
+	int			start_port;
+	int			end_port;
+	struct inform_port	port[0];
+};
+
+enum inform_state {
+	INFORM_IDLE,
+	INFORM_REGISTERING,
+	INFORM_MEMBER,
+	INFORM_BUSY,
+	INFORM_ERROR
+};
+
+struct inform_member;
+
+struct inform_group {
+	u16			trap_number;
+	struct rb_node		node;
+	struct inform_port	*port;
+	spinlock_t		lock;
+	struct work_struct	work;
+	struct list_head	pending_list;
+	struct list_head	active_list;
+	struct list_head	notice_list;
+	struct inform_member	*last_join;
+	int			members;
+	enum inform_state	join_state; /* State relative to SA */
+	atomic_t		refcount;
+	enum inform_state	state;
+	struct ib_sa_query	*query;
+	int			query_id;
+};
+
+struct inform_member {
+	struct ib_inform_info	info;
+	struct ib_sa_client	*client;
+	struct inform_group	*group;
+	struct list_head	list;
+	enum inform_state	state;
+	atomic_t		refcount;
+	struct completion	comp;
+};
+
+struct inform_notice {
+	struct list_head	list;
+	struct ib_sa_notice	notice;
+};
+
+static void reg_handler(int status, struct ib_sa_inform *inform,
+			 void *context);
+static void unreg_handler(int status, struct ib_sa_inform *inform,
+			  void *context);
+
+static struct inform_group *inform_find(struct inform_port *port,
+					u16 trap_number)
+{
+	struct rb_node *node = port->table.rb_node;
+	struct inform_group *group;
+
+	while (node) {
+		group = rb_entry(node, struct inform_group, node);
+		if (trap_number < group->trap_number)
+			node = node->rb_left;
+		else if (trap_number > group->trap_number)
+			node = node->rb_right;
+		else
+			return group;
+	}
+	return NULL;
+}
+
+static struct inform_group *inform_insert(struct inform_port *port,
+					  struct inform_group *group)
+{
+	struct rb_node **link = &port->table.rb_node;
+	struct rb_node *parent = NULL;
+	struct inform_group *cur_group;
+
+	while (*link) {
+		parent = *link;
+		cur_group = rb_entry(parent, struct inform_group, node);
+		if (group->trap_number < cur_group->trap_number)
+			link = &(*link)->rb_left;
+		else if (group->trap_number > cur_group->trap_number)
+			link = &(*link)->rb_right;
+		else
+			return cur_group;
+	}
+	rb_link_node(&group->node, parent, link);
+	rb_insert_color(&group->node, &port->table);
+	return NULL;
+}
+
+static void deref_port(struct inform_port *port)
+{
+	if (atomic_dec_and_test(&port->refcount))
+		complete(&port->comp);
+}
+
+static void release_group(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	if (atomic_dec_and_test(&group->refcount)) {
+		rb_erase(&group->node, &port->table);
+		spin_unlock_irqrestore(&port->lock, flags);
+		kfree(group);
+		deref_port(port);
+	} else
+		spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void deref_member(struct inform_member *member)
+{
+	if (atomic_dec_and_test(&member->refcount))
+		complete(&member->comp);
+}
+
+static void queue_reg(struct inform_member *member)
+{
+	struct inform_group *group = member->group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&group->lock, flags);
+	list_add(&member->list, &group->pending_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		atomic_inc(&group->refcount);
+		queue_work(inform_wq, &group->work);
+	}
+	spin_unlock_irqrestore(&group->lock, flags);
+}
+
+static int send_reg(struct inform_group *group, struct inform_member *member)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.subscribe = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(member->info.trap_number);
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	group->last_join = member;
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, IB_MGMT_METHOD_SET, &inform,
+				     0, 3000, GFP_KERNEL, reg_handler, group,
+				     &group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static int send_unreg(struct inform_group *group)
+{
+	struct inform_port *port = group->port;
+	struct ib_sa_inform inform;
+	int ret;
+
+	memset(&inform, 0, sizeof inform);
+	inform.lid_range_begin = cpu_to_be16(0xFFFF);
+	inform.is_generic = 1;
+	inform.type = cpu_to_be16(IB_SA_EVENT_TYPE_ALL);
+	inform.trap.generic.trap_num = cpu_to_be16(group->trap_number);
+	inform.trap.generic.qpn = IB_QP1;
+	inform.trap.generic.resp_time = 19;
+	inform.trap.generic.producer_type =
+				cpu_to_be32(IB_SA_EVENT_PRODUCER_TYPE_ALL);
+
+	ret = ib_sa_informinfo_query(&sa_client, port->dev->device,
+				     port->port_num, IB_MGMT_METHOD_SET,
+				     &inform, 0, 3000, GFP_KERNEL,
+				     unreg_handler, group, &group->query);
+	if (ret >= 0) {
+		group->query_id = ret;
+		ret = 0;
+	}
+	return ret;
+}
+
+static void join_group(struct inform_group *group, struct inform_member *member)
+{
+	member->state = INFORM_MEMBER;
+	group->members++;
+	list_move(&member->list, &group->active_list);
+}
+
+static int fail_join(struct inform_group *group, struct inform_member *member,
+		     int status)
+{
+	spin_lock_irq(&group->lock);
+	list_del_init(&member->list);
+	spin_unlock_irq(&group->lock);
+	return member->info.callback(status, &member->info, NULL);
+}
+
+static void process_group_error(struct inform_group *group)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->active_list)) {
+		member = list_entry(group->active_list.next,
+				    struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		group->members--;
+		member->state = INFORM_ERROR;
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(-ENETRESET, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	group->join_state = INFORM_IDLE;
+	group->state = INFORM_BUSY;
+	spin_unlock_irq(&group->lock);
+}
+
+/*
+ * Report a notice to all active subscribers.  We use a temporary list to
+ * handle unsubscription requests while the notice is being reported, which
+ * avoids holding the group lock while in the user's callback.
+ */
+static void process_notice(struct inform_group *group,
+			   struct inform_notice *info_notice)
+{
+	struct inform_member *member;
+	struct list_head list;
+	int ret;
+
+	INIT_LIST_HEAD(&list);
+
+	spin_lock_irq(&group->lock);
+	list_splice_init(&group->active_list, &list);
+	while (!list_empty(&list)) {
+
+		member = list_entry(list.next, struct inform_member, list);
+		atomic_inc(&member->refcount);
+		list_move(&member->list, &group->active_list);
+		spin_unlock_irq(&group->lock);
+
+		ret = member->info.callback(0, &member->info,
+					    &info_notice->notice);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+	spin_unlock_irq(&group->lock);
+}
+
+static void inform_work_handler(void *data)
+{
+	struct inform_group *group = data;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	struct inform_notice *info_notice;
+	int status, ret;
+
+retest:
+	spin_lock_irq(&group->lock);
+	while (!list_empty(&group->pending_list) ||
+	       !list_empty(&group->notice_list) ||
+	       (group->state == INFORM_ERROR)) {
+
+		if (group->state == INFORM_ERROR) {
+			spin_unlock_irq(&group->lock);
+			process_group_error(group);
+			goto retest;
+		}
+
+		if (!list_empty(&group->notice_list)) {
+			info_notice = list_entry(group->notice_list.next,
+						 struct inform_notice, list);
+			list_del(&info_notice->list);
+			spin_unlock_irq(&group->lock);
+			process_notice(group, info_notice);
+			kfree(info_notice);
+			goto retest;
+		}
+
+		member = list_entry(group->pending_list.next,
+				    struct inform_member, list);
+		info = &member->info;
+		atomic_inc(&member->refcount);
+
+		if (group->join_state == INFORM_MEMBER) {
+			join_group(group, member);
+			spin_unlock_irq(&group->lock);
+			ret = info->callback(0, info, NULL);
+		} else {
+			spin_unlock_irq(&group->lock);
+			status = send_reg(group, member);
+			if (!status) {
+				deref_member(member);
+				return;
+			}
+			ret = fail_join(group, member, status);
+		}
+
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+		spin_lock_irq(&group->lock);
+	}
+
+	if (!group->members && (group->join_state == INFORM_MEMBER)) {
+		group->join_state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		if (send_unreg(group))
+			goto retest;
+	} else {
+		group->state = INFORM_IDLE;
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+}
+
+/*
+ * Fail a join request if it is still active - at the head of the pending queue.
+ */
+static void process_join_error(struct inform_group *group, int status)
+{
+	struct inform_member *member;
+	int ret;
+
+	spin_lock_irq(&group->lock);
+	member = list_entry(group->pending_list.next,
+			    struct inform_member, list);
+	if (group->last_join == member) {
+		atomic_inc(&member->refcount);
+		list_del_init(&member->list);
+		spin_unlock_irq(&group->lock);
+		ret = member->info.callback(status, &member->info, NULL);
+		deref_member(member);
+		if (ret)
+			ib_sa_unregister_inform_info(&member->info);
+	} else
+		spin_unlock_irq(&group->lock);
+}
+
+static void reg_handler(int status, struct ib_sa_inform *inform, void *context)
+{
+	struct inform_group *group = context;
+
+	if (status)
+		process_join_error(group, status);
+	else
+		group->join_state = INFORM_MEMBER;
+
+	inform_work_handler(group);
+}
+
+static void unreg_handler(int status, struct ib_sa_inform *rec, void *context)
+{
+	inform_work_handler(context);
+}
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	struct inform_group *group;
+	struct inform_notice *info_notice;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return 0; /* No one to give notice to. */
+
+	port = &dev->port[port_num - dev->start_port];
+	spin_lock_irq(&port->lock);
+	group = inform_find(port, __be16_to_cpu(notice->trap.
+						generic.trap_num));
+	if (!group) {
+		spin_unlock_irq(&port->lock);
+		return 0;
+	}
+
+	atomic_inc(&group->refcount);
+	spin_unlock_irq(&port->lock);
+
+	info_notice = kmalloc(sizeof *info_notice, GFP_KERNEL);
+	if (!info_notice) {
+		release_group(group);
+		return -ENOMEM;
+	}
+
+	info_notice->notice = *notice;
+
+	spin_lock_irq(&group->lock);
+	list_add(&info_notice->list, &group->notice_list);
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		inform_work_handler(group);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	return 0;
+}
+
+static struct inform_group *acquire_group(struct inform_port *port,
+					  u16 trap_number, gfp_t gfp_mask)
+{
+	struct inform_group *group, *cur_group;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	group = inform_find(port, trap_number);
+	if (group)
+		goto found;
+	spin_unlock_irqrestore(&port->lock, flags);
+
+	group = kzalloc(sizeof *group, gfp_mask);
+	if (!group)
+		return NULL;
+
+	group->port = port;
+	group->trap_number = trap_number;
+	INIT_LIST_HEAD(&group->pending_list);
+	INIT_LIST_HEAD(&group->active_list);
+	INIT_LIST_HEAD(&group->notice_list);
+	INIT_WORK(&group->work, inform_work_handler, group);
+	spin_lock_init(&group->lock);
+
+	spin_lock_irqsave(&port->lock, flags);
+	cur_group = inform_insert(port, group);
+	if (cur_group) {
+		kfree(group);
+		group = cur_group;
+	} else
+		atomic_inc(&port->refcount);
+found:
+	atomic_inc(&group->refcount);
+	spin_unlock_irqrestore(&port->lock, flags);
+	return group;
+}
+
+/*
+ * We serialize all join requests to a single group to make our lives much
+ * easier.  Otherwise, two users could try to join the same group
+ * simultaneously, with different configurations, one could leave while the
+ * join is in progress, etc., which makes locking around error recovery
+ * difficult.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context)
+{
+	struct inform_device *dev;
+	struct inform_member *member;
+	struct ib_inform_info *info;
+	int ret;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return ERR_PTR(-ENODEV);
+
+	member = kzalloc(sizeof *member, gfp_mask);
+	if (!member)
+		return ERR_PTR(-ENOMEM);
+
+	ib_sa_client_get(client);
+	member->client = client;
+	member->info.trap_number = trap_number;
+	member->info.callback = callback;
+	member->info.context = context;
+	init_completion(&member->comp);
+	atomic_set(&member->refcount, 1);
+	member->state = INFORM_REGISTERING;
+
+	member->group = acquire_group(&dev->port[port_num - dev->start_port],
+				      trap_number, gfp_mask);
+	if (!member->group) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	/*
+	 * The user will get the info structure in their callback.  They
+	 * could then free the info structure before we can return from
+	 * this routine.  So we save the pointer to return before queuing
+	 * any callback.
+	 */
+	info = &member->info;
+	queue_reg(member);
+	return info;
+
+err:
+	ib_sa_client_put(member->client);
+	kfree(member);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(ib_sa_register_inform_info);
+
+void ib_sa_unregister_inform_info(struct ib_inform_info *info)
+{
+	struct inform_member *member;
+	struct inform_group *group;
+
+	member = container_of(info, struct inform_member, info);
+	group = member->group;
+
+	spin_lock_irq(&group->lock);
+	if (member->state == INFORM_MEMBER)
+		group->members--;
+
+	list_del_init(&member->list);
+
+	if (group->state == INFORM_IDLE) {
+		group->state = INFORM_BUSY;
+		spin_unlock_irq(&group->lock);
+		/* Continue to hold reference on group until callback */
+		queue_work(inform_wq, &group->work);
+	} else {
+		spin_unlock_irq(&group->lock);
+		release_group(group);
+	}
+
+	deref_member(member);
+	wait_for_completion(&member->comp);
+	ib_sa_client_put(member->client);
+	kfree(member);
+}
+EXPORT_SYMBOL(ib_sa_unregister_inform_info);
+
+static void inform_groups_lost(struct inform_port *port)
+{
+	struct inform_group *group;
+	struct rb_node *node;
+	unsigned long flags;
+
+	spin_lock_irqsave(&port->lock, flags);
+	for (node = rb_first(&port->table); node; node = rb_next(node)) {
+		group = rb_entry(node, struct inform_group, node);
+		spin_lock(&group->lock);
+		if (group->state == INFORM_IDLE) {
+			atomic_inc(&group->refcount);
+			queue_work(inform_wq, &group->work);
+		}
+		group->state = INFORM_ERROR;
+		spin_unlock(&group->lock);
+	}
+	spin_unlock_irqrestore(&port->lock, flags);
+}
+
+static void inform_event_handler(struct ib_event_handler *handler,
+				struct ib_event *event)
+{
+	struct inform_device *dev;
+
+	dev = ib_get_client_data(event->device, &inform_client);
+	if (!dev)
+		return;
+
+	switch (event->event) {
+	case IB_EVENT_PORT_ERR:
+	case IB_EVENT_LID_CHANGE:
+	case IB_EVENT_SM_CHANGE:
+	case IB_EVENT_CLIENT_REREGISTER:
+		inform_groups_lost(&dev->port[event->element.port_num -
+					      dev->start_port]);
+		break;
+	default:
+		break;
+	}
+}
+
+static void inform_add_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+		return;
+
+	dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
+		      GFP_KERNEL);
+	if (!dev)
+		return;
+
+	if (device->node_type == RDMA_NODE_IB_SWITCH)
+		dev->start_port = dev->end_port = 0;
+	else {
+		dev->start_port = 1;
+		dev->end_port = device->phys_port_cnt;
+	}
+
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		port->dev = dev;
+		port->port_num = dev->start_port + i;
+		spin_lock_init(&port->lock);
+		port->table = RB_ROOT;
+		init_completion(&port->comp);
+		atomic_set(&port->refcount, 1);
+	}
+
+	dev->device = device;
+	ib_set_client_data(device, &inform_client, dev);
+
+	INIT_IB_EVENT_HANDLER(&event_handler, device, inform_event_handler);
+	ib_register_event_handler(&event_handler);
+}
+
+static void inform_remove_one(struct ib_device *device)
+{
+	struct inform_device *dev;
+	struct inform_port *port;
+	int i;
+
+	dev = ib_get_client_data(device, &inform_client);
+	if (!dev)
+		return;
+
+	ib_unregister_event_handler(&event_handler);
+	flush_workqueue(inform_wq);
+
+	for (i = 0; i < dev->end_port - dev->start_port; i++) {
+		port = &dev->port[i];
+		deref_port(port);
+		wait_for_completion(&port->comp);
+	}
+
+	kfree(dev);
+}
+
+int notice_init(void)
+{
+	int ret;
+
+	inform_wq = create_singlethread_workqueue("ib_inform_wq");
+	if (!inform_wq)
+		return -ENOMEM;
+
+	ib_sa_register_client(&sa_client);
+
+	ret = ib_register_client(&inform_client);
+	if (ret)
+		goto err;
+	return 0;
+
+err:
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+	return ret;
+}
+
+void notice_cleanup(void)
+{
+	ib_unregister_client(&inform_client);
+	ib_sa_unregister_client(&sa_client);
+	destroy_workqueue(inform_wq);
+}
diff --git a/drivers/infiniband/core/sa.h b/drivers/infiniband/core/sa.h
index 24c93fd..31cde28 100644
--- a/drivers/infiniband/core/sa.h
+++ b/drivers/infiniband/core/sa.h
@@ -63,4 +63,21 @@ int ib_sa_mcmember_rec_query(struct ib_s
 int mcast_init(void);
 void mcast_cleanup(void);
 
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num, u8 method,
+			   struct ib_sa_inform *rec,
+			   ib_sa_comp_mask comp_mask,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					    struct ib_sa_inform *resp,
+					    void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query);
+
+int notice_dispatch(struct ib_device *device, u8 port_num,
+		    struct ib_sa_notice *notice);
+
+int notice_init(void);
+void notice_cleanup(void);
+
 #endif /* SA_H */
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index ea78687..88c228c 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -61,10 +61,12 @@ struct ib_sa_sm_ah {
 
 struct ib_sa_port {
 	struct ib_mad_agent *agent;
+	struct ib_mad_agent *notice_agent;
 	struct ib_sa_sm_ah  *sm_ah;
 	struct work_struct   update_task;
 	spinlock_t           ah_lock;
 	u8                   port_num;
+	struct ib_device    *device;
 };
 
 struct ib_sa_device {
@@ -101,6 +103,12 @@ struct ib_sa_mcmember_query {
 	struct ib_sa_query sa_query;
 };
 
+struct ib_sa_inform_query {
+	void (*callback)(int, struct ib_sa_inform *, void *);
+	void *context;
+	struct ib_sa_query sa_query;
+};
+
 static void ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device);
 
@@ -352,6 +360,110 @@ static const struct ib_field service_rec
 	  .size_bits    = 2*64 },
 };
 
+#define INFORM_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_inform, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_inform *) 0)->field, \
+	.field_name          = "sa_inform:" #field
+
+static const struct ib_field inform_table[] = {
+	{ INFORM_FIELD(gid),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+	{ INFORM_FIELD(lid_range_begin),
+	  .offset_words = 4,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(lid_range_end),
+	  .offset_words = 4,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ RESERVED,
+	  .offset_words = 5,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(is_generic),
+	  .offset_words = 5,
+	  .offset_bits  = 16,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(subscribe),
+	  .offset_words = 5,
+	  .offset_bits  = 24,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(type),
+	  .offset_words = 6,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.trap_num),
+	  .offset_words = 6,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ INFORM_FIELD(trap.generic.qpn),
+	  .offset_words = 7,
+	  .offset_bits  = 0,
+	  .size_bits    = 24 },
+	{ RESERVED,
+	  .offset_words = 7,
+	  .offset_bits  = 24,
+	  .size_bits    = 3 },
+	{ INFORM_FIELD(trap.generic.resp_time),
+	  .offset_words = 7,
+	  .offset_bits  = 27,
+	  .size_bits    = 5 },
+	{ RESERVED,
+	  .offset_words = 8,
+	  .offset_bits  = 0,
+	  .size_bits    = 8 },
+	{ INFORM_FIELD(trap.generic.producer_type),
+	  .offset_words = 8,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+};
+
+#define NOTICE_FIELD(field) \
+	.struct_offset_bytes = offsetof(struct ib_sa_notice, field), \
+	.struct_size_bytes   = sizeof ((struct ib_sa_notice *) 0)->field, \
+	.field_name          = "sa_notice:" #field
+
+static const struct ib_field notice_table[] = {
+	{ NOTICE_FIELD(is_generic),
+	  .offset_words = 0,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(type),
+	  .offset_words = 0,
+	  .offset_bits  = 1,
+	  .size_bits    = 7 },
+	{ NOTICE_FIELD(trap.generic.producer_type),
+	  .offset_words = 0,
+	  .offset_bits  = 8,
+	  .size_bits    = 24 },
+	{ NOTICE_FIELD(trap.generic.trap_num),
+	  .offset_words = 1,
+	  .offset_bits  = 0,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(issuer_lid),
+	  .offset_words = 1,
+	  .offset_bits  = 16,
+	  .size_bits    = 16 },
+	{ NOTICE_FIELD(notice_toggle),
+	  .offset_words = 2,
+	  .offset_bits  = 0,
+	  .size_bits    = 1 },
+	{ NOTICE_FIELD(notice_count),
+	  .offset_words = 2,
+	  .offset_bits  = 1,
+	  .size_bits    = 15 },
+	{ NOTICE_FIELD(data_details),
+	  .offset_words = 2,
+	  .offset_bits  = 16,
+	  .size_bits    = 432 },
+	{ NOTICE_FIELD(issuer_gid),
+	  .offset_words = 16,
+	  .offset_bits  = 0,
+	  .size_bits    = 128 },
+};
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -890,6 +1002,156 @@ err1:
 	return ret;
 }
 
+static void ib_sa_inform_callback(struct ib_sa_query *sa_query,
+				  int status,
+				  struct ib_sa_mad *mad)
+{
+	struct ib_sa_inform_query *query =
+		container_of(sa_query, struct ib_sa_inform_query, sa_query);
+
+	if (mad) {
+		struct ib_sa_inform rec;
+
+		ib_unpack(inform_table, ARRAY_SIZE(inform_table),
+			  mad->data, &rec);
+		query->callback(status, &rec, query->context);
+	} else
+		query->callback(status, NULL, query->context);
+}
+
+static void ib_sa_inform_release(struct ib_sa_query *sa_query)
+{
+	kfree(container_of(sa_query, struct ib_sa_inform_query, sa_query));
+}
+
+/**
+ * ib_sa_informinfo_query - Start an InformInfo registration.
+ * @client:SA client
+ * @device:device to send query on
+ * @port_num: port number to send query on
+ * @rec:Inform record to send in query
+ * @comp_mask:component mask to send in query
+ * @timeout_ms:time to wait for response
+ * @gfp_mask:GFP mask to use for internal allocations
+ * @callback:function called when notice handler registration completes,
+ * times out or is canceled
+ * @context:opaque user context passed to callback
+ * @sa_query:query context, used to cancel query
+ *
+ * This function sends inform info to register with SA to receive
+ * in-service notice.
+ * The callback function will be called when the query completes (or
+ * fails); status is 0 for a successful response, -EINTR if the query
+ * is canceled, -ETIMEDOUT is the query timed out, or -EIO if an error
+ * occurred sending the query.  The resp parameter of the callback is
+ * only valid if status is 0.
+ *
+ * If the return value of ib_sa_inform_query() is negative, it is an
+ * error code.  Otherwise it is a query ID that can be used to cancel
+ * the query.
+ */
+int ib_sa_informinfo_query(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num, u8 method,
+			   struct ib_sa_inform *rec,
+			   ib_sa_comp_mask comp_mask,
+			   int timeout_ms, gfp_t gfp_mask,
+			   void (*callback)(int status,
+					   struct ib_sa_inform *resp,
+					   void *context),
+			   void *context,
+			   struct ib_sa_query **sa_query)
+{
+	struct ib_sa_inform_query *query;
+	struct ib_sa_device *sa_dev = ib_get_client_data(device, &sa_client);
+	struct ib_sa_port   *port;
+	struct ib_mad_agent *agent;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	if (!sa_dev)
+		return -ENODEV;
+
+	port  = &sa_dev->port[port_num - sa_dev->start_port];
+	agent = port->agent;
+
+	query = kmalloc(sizeof *query, gfp_mask);
+	if (!query)
+		return -ENOMEM;
+
+	query->sa_query.mad_buf = ib_create_send_mad(agent, 1, 0,
+						     0, IB_MGMT_SA_HDR,
+						     IB_MGMT_SA_DATA, gfp_mask);
+	if (!query->sa_query.mad_buf) {
+		ret = -ENOMEM;
+		goto err1;
+	}
+
+	ib_sa_client_get(client);
+	query->sa_query.client = client;
+	query->callback = callback;
+	query->context  = context;
+
+	mad = query->sa_query.mad_buf->mad;
+	init_mad(mad, agent);
+
+	query->sa_query.callback = callback ? ib_sa_inform_callback : NULL;
+	query->sa_query.release  = ib_sa_inform_release;
+	query->sa_query.port     = port;
+	mad->mad_hdr.method	 = method;
+	mad->mad_hdr.attr_id	 = cpu_to_be16(IB_SA_ATTR_INFORM_INFO);
+	mad->sa_hdr.comp_mask	 = comp_mask;
+
+	ib_pack(inform_table, ARRAY_SIZE(inform_table), rec, mad->data);
+
+	*sa_query = &query->sa_query;
+	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
+	if (ret < 0)
+		goto err2;
+
+	return ret;
+
+err2:
+	*sa_query = NULL;
+	ib_sa_client_put(query->sa_query.client);
+	ib_free_send_mad(query->sa_query.mad_buf);
+err1:
+	kfree(query);
+	return ret;
+}
+
+static void ib_sa_notice_resp(struct ib_sa_port *port,
+			      struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_mad_send_buf *mad_buf;
+	struct ib_sa_mad *mad;
+	int ret;
+
+	mad_buf = ib_create_send_mad(port->notice_agent, 1, 0, 0,
+				     IB_MGMT_SA_HDR, IB_MGMT_SA_DATA,
+				     GFP_KERNEL);
+	if (!mad_buf)
+		return;
+
+	mad = mad_buf->mad;
+	memcpy(mad, &mad_recv_wc->recv_buf.mad, sizeof *mad);
+	mad->mad_hdr.method = IB_MGMT_METHOD_REPORT_RESP;
+
+	spin_lock_irq(&port->ah_lock);
+	kref_get(&port->sm_ah->ref);
+	mad_buf->context[0] = &port->sm_ah->ref;
+	mad_buf->ah = port->sm_ah->ah;
+	spin_unlock_irq(&port->ah_lock);
+
+	ret = ib_post_send_mad(mad_buf, NULL);
+	if (ret)
+		goto err;
+
+	return;
+err:
+	kref_put(mad_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_buf);
+}
+
 static void send_handler(struct ib_mad_agent *agent,
 			 struct ib_mad_send_wc *mad_send_wc)
 {
@@ -944,9 +1206,36 @@ static void recv_handler(struct ib_mad_a
 	ib_free_recv_mad(mad_recv_wc);
 }
 
+static void notice_resp_handler(struct ib_mad_agent *agent,
+				struct ib_mad_send_wc *mad_send_wc)
+{
+	kref_put(mad_send_wc->send_buf->context[0], free_sm_ah);
+	ib_free_send_mad(mad_send_wc->send_buf);
+}
+
+static void notice_handler(struct ib_mad_agent *mad_agent,
+			   struct ib_mad_recv_wc *mad_recv_wc)
+{
+	struct ib_sa_port *port;
+	struct ib_sa_mad *mad;
+	struct ib_sa_notice notice;
+
+	port = mad_agent->context;
+	mad = (struct ib_sa_mad *) mad_recv_wc->recv_buf.mad;
+	ib_unpack(notice_table, ARRAY_SIZE(notice_table), mad->data, &notice);
+
+	if (!notice_dispatch(port->device, port->port_num, &notice))
+		ib_sa_notice_resp(port, mad_recv_wc);
+	ib_free_recv_mad(mad_recv_wc);
+}
+
 static void ib_sa_add_one(struct ib_device *device)
 {
 	struct ib_sa_device *sa_dev;
+	struct ib_mad_reg_req reg_req = {
+		.mgmt_class = IB_MGMT_CLASS_SUBN_ADM,
+		.mgmt_class_version = 2
+	};
 	int s, e, i;
 
 	if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
@@ -980,6 +1269,16 @@ static void ib_sa_add_one(struct ib_devi
 		if (IS_ERR(sa_dev->port[i].agent))
 			goto err;
 
+		sa_dev->port[i].device = device;
+		set_bit(IB_MGMT_METHOD_REPORT, reg_req.method_mask);
+		sa_dev->port[i].notice_agent =
+			ib_register_mad_agent(device, i + s, IB_QPT_GSI,
+					      &reg_req, 0, notice_resp_handler,
+					      notice_handler, &sa_dev->port[i]);
+
+		if (IS_ERR(sa_dev->port[i].notice_agent))
+			goto err;
+
 		INIT_WORK(&sa_dev->port[i].update_task,
 			  update_sm_ah, &sa_dev->port[i]);
 	}
@@ -1003,8 +1302,14 @@ static void ib_sa_add_one(struct ib_devi
 	return;
 
 err:
-	while (--i >= 0)
-		ib_unregister_mad_agent(sa_dev->port[i].agent);
+	while (--i >= 0) {
+		if (!IS_ERR(sa_dev->port[i].notice_agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
+		}
+		if (!IS_ERR(sa_dev->port[i].agent)) {
+			ib_unregister_mad_agent(sa_dev->port[i].agent);
+		}
+	}
 
 	kfree(sa_dev);
 
@@ -1024,6 +1329,7 @@ static void ib_sa_remove_one(struct ib_d
 	flush_scheduled_work();
 
 	for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
+		ib_unregister_mad_agent(sa_dev->port[i].notice_agent);
 		ib_unregister_mad_agent(sa_dev->port[i].agent);
 		kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
 	}
@@ -1052,7 +1358,15 @@ static int __init ib_sa_init(void)
 		goto err2;
 	}
 
+	ret = notice_init();
+	if (ret) {
+		printk(KERN_ERR "Couldn't initialize notice handling\n");
+		goto err3;
+	}
+
 	return 0;
+err3:
+	mcast_cleanup();
 err2:
 	ib_unregister_client(&sa_client);
 err1:
@@ -1062,6 +1376,7 @@ err1:
 static void __exit ib_sa_cleanup(void)
 {
 	mcast_cleanup();
+	notice_cleanup();
 	ib_unregister_client(&sa_client);
 	idr_destroy(&query_idr);
 }
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 3b957e5..1bbf88a 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -254,6 +254,143 @@ struct ib_sa_service_rec {
 	u64		data64[2];
 };
 
+enum {
+	IB_SA_EVENT_TYPE_FATAL		= 0x0,
+	IB_SA_EVENT_TYPE_URGENT		= 0x1,
+	IB_SA_EVENT_TYPE_SECURITY	= 0x2,
+	IB_SA_EVENT_TYPE_SM		= 0x3,
+	IB_SA_EVENT_TYPE_INFO		= 0x4,
+	IB_SA_EVENT_TYPE_EMPTY		= 0x7F,
+	IB_SA_EVENT_TYPE_ALL		= 0xFFFF
+};
+
+enum {
+	IB_SA_EVENT_PRODUCER_TYPE_CA		= 0x1,
+	IB_SA_EVENT_PRODUCER_TYPE_SWITCH	= 0x2,
+	IB_SA_EVENT_PRODUCER_TYPE_ROUTER	= 0x3,
+	IB_SA_EVENT_PRODUCER_TYPE_CLASS_MANAGER	= 0x4,
+	IB_SA_EVENT_PRODUCER_TYPE_ALL		= 0xFFFFFF
+};
+
+enum {
+	IB_SA_SM_TRAP_GID_IN_SERVICE			= 64,
+	IB_SA_SM_TRAP_GID_OUT_OF_SERVICE		= 65,
+	IB_SA_SM_TRAP_CREATE_MC_GROUP			= 66,
+	IB_SA_SM_TRAP_DELETE_MC_GROUP			= 67,
+	IB_SA_SM_TRAP_PORT_CHANGE_STATE			= 128,
+	IB_SA_SM_TRAP_LINK_INTEGRITY			= 129,
+	IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN		= 130,
+	IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131,
+	IB_SA_SM_TRAP_BAD_M_KEY				= 256,
+	IB_SA_SM_TRAP_BAD_P_KEY				= 257,
+	IB_SA_SM_TRAP_BAD_Q_KEY				= 258,
+	IB_SA_SM_TRAP_ALL				= 0xFFFF
+};
+
+#define IB_SA_INFORM_GID				IB_SA_COMP_MASK( 0)
+#define IB_SA_INFORM_LID_RANGE_BEGIN			IB_SA_COMP_MASK( 1)
+#define IB_SA_INFORM_LID_RANGE_END			IB_SA_COMP_MASK( 2)
+/* reserved:								 3 */
+#define IB_SA_INFORM_IS_GENERIC				IB_SA_COMP_MASK( 4)
+#define IB_SA_INFORM_SUBCRIBE				IB_SA_COMP_MASK( 5)
+#define IB_SA_INFORM_TYPE				IB_SA_COMP_MASK( 6)
+
+#define IB_SA_INFORM_TRAP_NUMBER			IB_SA_COMP_MASK( 7)
+#define IB_SA_INFORM_DEVICE_ID				IB_SA_COMP_MASK( 7)
+#define IB_SA_INFORM_QPN				IB_SA_COMP_MASK( 8)
+/* reserved:								 9 */
+#define IB_SA_INFORM_RESP_TIME				IB_SA_COMP_MASK(10)
+/* reserved:								11 */
+#define IB_SA_INFORM_PRODUCER_TYPE			IB_SA_COMP_MASK(12)
+#define IB_SA_INFORM_VENDOR_ID				IB_SA_COMP_MASK(12)
+
+struct ib_sa_inform {
+	union ib_gid	gid;
+	__be16		lid_range_begin;
+	__be16		lid_range_end;
+	u8		is_generic;
+	u8		subscribe;
+	__be16		type;
+	union {
+		struct {
+			__be16	trap_num;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	producer_type;
+		} generic;
+		struct {
+			__be16	device_id;
+			__be32	qpn;
+			u8	resp_time;
+			__be32	vendor_id;
+		} vendor;
+	} trap;
+};
+
+struct ib_sa_notice {
+	u8		is_generic;
+	u8		type;
+	union {
+		struct {
+			__be32	producer_type;
+			__be16	trap_num;
+		} generic;
+		struct {
+			__be32	vendor_id;
+			__be16	device_id;
+		} vendor;
+	} trap;
+	__be16		issuer_lid;
+	__be16		notice_count;
+	u8		notice_toggle;
+	/*
+	 * Align data 16 bits off 64 bit field to match InformInfo definition.
+	 * Data contained within this field will then align properly.
+	 * See IB spec 1.2, sections 13.4.8.2 and 14.2.5.1.
+	 */
+	u8		reserved[5];
+	u8		data_details[54];
+	union ib_gid	issuer_gid;
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_GID_IN_SERVICE		= 64
+ * IB_SA_SM_TRAP_GID_OUT_OF_SERVICE	= 65
+ * IB_SA_SM_TRAP_CREATE_MC_GROUP	= 66
+ * IB_SA_SM_TRAP_DELETE_MC_GROUP	= 67
+ */
+struct ib_sa_notice_data_gid {
+	u8	reserved[6];
+	u8	gid[16];
+	u8	padding[32];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_PORT_CHANGE_STATE	= 128
+ */
+struct ib_sa_notice_data_port_change {
+	__be16	lid;
+	u8	padding[52];
+};
+
+/*
+ * SM notice data details for:
+ *
+ * IB_SA_SM_TRAP_LINK_INTEGRITY			= 129
+ * IB_SA_SM_TRAP_EXCESSIVE_BUFFER_OVERRUN	= 130
+ * IB_SA_SM_TRAP_FLOW_CONTROL_UPDATE_EXPIRED	= 131
+ */
+struct ib_sa_notice_data_port_error {
+	u8	reserved[2];
+	__be16	lid;
+	u8	port_num;
+	u8	padding[49];
+};
+
 struct ib_sa_client {
 	atomic_t users;
 	struct completion comp;
@@ -387,4 +524,54 @@ int ib_init_ah_from_path(struct ib_devic
 			 struct ib_sa_path_rec *rec,
 			 struct ib_ah_attr *ah_attr);
 
+struct ib_inform_info {
+	void		*context;
+	int		(*callback)(int status,
+				    struct ib_inform_info *info,
+				    struct ib_sa_notice *notice);
+	u16		trap_number;
+};
+
+/**
+ * ib_sa_register_inform_info - Registers to receive notice events.
+ * @device: Device associated with the registration.
+ * @port_num: Port on the specified device to associate with the registration.
+ * @trap_number: InformInfo trap number to register for.
+ * @gfp_mask: GFP mask for memory allocations.
+ * @callback: User callback invoked once the registration completes and to
+ *   report noticed events.
+ * @context: User specified context stored with the ib_inform_reg structure.
+ *
+ * This call initiates a registration request with the SA for the specified
+ * trap number.  If the operation is started successfully, it returns
+ * an ib_inform_info structure that is used to track the registration operation.
+ * Users must free this structure by calling ib_unregister_inform_info,
+ * even if the operation later fails.  (The callback status is non-zero.)
+ *
+ * If the registration fails; status will be non-zero.  If the registration
+ * succeeds, the callback status will be zero, but the notice parameter will
+ * be NULL.  If the notice parameter is not NULL, a trap or notice is being
+ * reported to the user.
+ *
+ * A status of -ENETRESET indicates that an error occurred which requires
+ * reregisteration.
+ */
+struct ib_inform_info *
+ib_sa_register_inform_info(struct ib_sa_client *client,
+			   struct ib_device *device, u8 port_num,
+			   u16 trap_number, gfp_t gfp_mask,
+			   int (*callback)(int status,
+					   struct ib_inform_info *info,
+					   struct ib_sa_notice *notice),
+			   void *context);
+
+/**
+ * ib_sa_unregister_inform_info - Releases an InformInfo registration.
+ * @info: InformInfo registration tracking structure.
+ *
+ * This call blocks until the registration request is destroyed.  It may
+ * not be called from within the registration callback.
+ */
+void ib_sa_unregister_inform_info(struct ib_inform_info *info);
+
 #endif /* IB_SA_H */


From bugzilla-daemon at openib.org  Thu Dec 14 16:25:43 2006
From: bugzilla-daemon at openib.org (bugzilla-daemon at openib.org)
Date: Thu, 14 Dec 2006 16:25:43 -0800 (PST)
Subject: [openib-general] [Bug 159] OFED1.0: Missing interfaces
Message-ID: <20061215002543.2EF132283D4@openib.ca.sandia.gov>

http://openib.org/bugzilla/show_bug.cgi?id=159


------- Comment #5 from sean.hefty at intel.com  2006-12-14 16:25 -------
A proposed interface and implementation to register for SA events is available
at: git://staging.openfabrics.org/~shefty/rdma-dev.git under the branch
informinfo.


------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


From xma at us.ibm.com  Thu Dec 14 16:53:26 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Thu, 14 Dec 2006 16:53:26 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061210134137.GL29174@mellanox.co.il>
Message-ID: <OFA1309FD2.CA0008BE-ON87257245.00047D5B-88257245.0004E47D@us.ibm.com>


Hi, Michael,

Tried this patch, it didn't work on ehca. I couldn't change the mode from
datagram to connected from /sys/class.
And when unloading ib_ipoib module, all the connections to that node gone,
rmmod ib_ipoib hung. The kernel is 2.6.19.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061214/ce1a5e7c/attachment.html>

From mshefty at ichips.intel.com  Thu Dec 14 17:22:11 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 14 Dec 2006 17:22:11 -0800
Subject: [openib-general] librdmacm git repos needs config dir
In-Reply-To: <1166131727.12420.9.camel@stevo-desktop>
References: <1166131727.12420.9.camel@stevo-desktop>
Message-ID: <4581F8C3.9040203@ichips.intel.com>

> The librdmacm git repository needs a config dir or autoconf changes to
> make that dir as part of config.  I'm not a autoconf wiz, so I just
> created the config dir and put a hidden file named .gitignore in it for
> libamso. That way its created when folks clone it.  Dunno if that's the
> best way, but it worked...

I committed a patch to the master and rdma_ucm-abi3 branches that should 
auto-create the config directory if it's not there.  I also updated the libibcm.

- Sean


From mst at mellanox.co.il  Thu Dec 14 21:07:22 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 15 Dec 2006 07:07:22 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <adad56mmbv1.fsf@cisco.com>
References: <adad56mmbv1.fsf@cisco.com>
Message-ID: <20061215050722.GF19449@mellanox.co.il>

> What saves us for the MTT table is that with your patch the device
> never writes to the MTT table at all.

Yes.


-- 
MST


From mst at mellanox.co.il  Thu Dec 14 21:08:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 15 Dec 2006 07:08:16 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <ada4prym9cz.fsf@cisco.com>
References: <ada4prym9cz.fsf@cisco.com>
Message-ID: <20061215050816.GG19449@mellanox.co.il>

>  > > What saves us for the MTT table is that with your patch the device
>  > > never writes to the MTT table at all.
>  > 
>  > Except for the reserved MTTs.
> 
> Good point.  So I guess we need a patch that makes sure all reserved
> MTTs are given their own ICM chunk (which doesn't need to be in
> lowmem) to fix things.

Or just round up the # of reserved MTTs to CPU cache line size.

-- 
MST


From mst at mellanox.co.il  Thu Dec 14 21:14:38 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 15 Dec 2006 07:14:38 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <OFA1309FD2.CA0008BE-ON87257245.00047D5B-88257245.0004E47D@us.ibm.com>
References: <20061210134137.GL29174@mellanox.co.il>
	<OFA1309FD2.CA0008BE-ON87257245.00047D5B-88257245.0004E47D@us.ibm.com>
Message-ID: <20061215051438.GH19449@mellanox.co.il>

> Hi, Michael,
> 
> Tried this patch, it didn't work on ehca. I couldn't change the mode from
> datagram to connected from /sys/class.

It's wroking as designed in that respect.  ehca does not implement srq - without
srq, there is no way to prepost receive buffers for a resonable number of
connections without running out of memory.

So it is falling back on datagram mode.
Talk to ehca guys to implement srq and connected mode will be enabled.

> And when unloading ib_ipoib module, all the connections to that node gone,
> rmmod ib_ipoib hung. The kernel is 2.6.19.

Probably a bug in error handling somewhere.
Post the sysrq t trace and I'll take a look.

-- 
MST


From kliteyn at dev.mellanox.co.il  Thu Dec 14 21:51:09 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Fri, 15 Dec 2006 07:51:09 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
	'hang'
In-Reply-To: <1166127103.28709.140656.camel@hal.voltaire.com>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166127103.28709.140656.camel@hal.voltaire.com>
Message-ID: <458237CD.80608@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> This patch fixes a bug that caused ucast manager to return
>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
>> Added a boolean flag that marks whether there was some change or not
>> (in which case OSM_SIGNAL_DONE should be returned).
> 
> Just wondering what is the test case for this ?

I found it while working on FatTree routing.
The problem appears when a routing engine fills all the forwarding tables, 
and then osm_ucast_mgr_set_fwd_table() will decide that all the tables are 
identical to what was already set on switches and there is nothing to send.
> 
> -- Hal
> 
> 


From or.gerlitz at gmail.com  Thu Dec 14 21:57:27 2006
From: or.gerlitz at gmail.com (Or Gerlitz)
Date: Fri, 15 Dec 2006 07:57:27 +0200
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <4581C4B5.5020702@ichips.intel.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
	<15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
	<4581C4B5.5020702@ichips.intel.com>
Message-ID: <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com>

On 12/14/06, Sean Hefty <mshefty at ichips.intel.com> wrote:
> > I see. I understand that there is some code which is part of OFED
> > (udapl) that uses this api, what were you thinking to suggest them to
> > do in the spirit of this code you have posted being the basis for OFED
> > 1.2 ?
>
> DAPL has been updated to remove its use of these calls.  The rdma cm timeout is
> essentially 1 minute now.

cool, before sending the orig email i was looking on both Arlin git
tree at ofa staging and the svn and the code that uses this calls are
still there, so were are the updated udapl sources?

Or.


From mst at mellanox.co.il  Thu Dec 14 22:31:27 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 15 Dec 2006 08:31:27 +0200
Subject: [openib-general] [PATCHv2] mthca: speed up memory registration
 by filling MTTs directly
In-Reply-To: <adad56mmbv1.fsf@cisco.com>
References: <adad56mmbv1.fsf@cisco.com>
Message-ID: <20061215063127.GB27865@mellanox.co.il>

>  > > With current code firmware might be doing WRITE_MTT while CPU is writing to the
>  > > same cache line, and I expect this might confuse things, but it seems that with
>  > > my fmr/mr merge patch, we never have both CPU and firmware write to the same
>  > > MTTs entries.
>  > > 
>  > > So, assuming my patch is applied why isn't sticking pci_dma_sync_sg in FMR code
>  > > sufficient?
> 
> Yes, assuming that the CPU is the only entity ever writing to the MTT
> table, then doing pci_dma_sync_sg_for_cpu() before writing and
> pci_dma_sync_sg_for_device() afterwards should be OK.  I think.

However, for MPTs it seems the best we can do is allocate them
out of coherent memory.

-- 
MST


From philippe_bernadat at hp.com  Thu Dec 14 23:58:32 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Fri, 15 Dec 2006 08:58:32 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
	(lustre)
In-Reply-To: <20061214173145.GC12781@mellanox.co.il>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538304@idaexc03.emea.cpqcorp.net>

I have set tavor_quirk to 1 with no effect.
Another thing I have tried is the same lustre 
LNET echo test with a single thread (vs 8)

VIB:      400 MB/s
OFED-1.1: 333 MB/s

I am posting the live param values for all infiniband 
modules in case someone could identify some wrong setting:

infiniband/core/ib_cm

mra_timeout_limit              30000

infiniband/core/rdma_cm

max_cm_retries                    15
tavor_quirk                        1

infiniband/hw/ipath/ib_ipath

cfgports                           0
debug                              1
disable_sma                        0
kpiobufs                           0
lkey_table_size                   12
max_ahs                        65535
max_cqes                      196607
max_cqs                       131071
max_mcast_grps                 16384
max_mcast_qp_attached             16
max_pds                        65535
max_qps                        16384
max_qp_wrs                     16383
max_sges                          96
max_srqs                        1024
max_srq_sges                     128
max_srq_wrs                   131071
qp_table_size                    251

infiniband/hw/mthca/ib_mthca

catas_reset_disable                0
debug_level                        0
fmr_reserved_mtts             262144
fw_cmd_doorbell                    0
msi                                0
msi_x                              1
num_cq                         65536
num_mcg                         8192
num_mpt                       131072
num_mtt                      1048576
num_qp                         65536
num_udav                       32768
rdb_per_qp                         4
tune_pci                           1

infiniband/ulp/ipoib/ib_ipoib

debug_level                        0
mcast_debug_level                  0
recv_queue_size                  128
send_queue_size                   64

Philippe

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> Sent: Thursday, December 14, 2006 6:32 PM
> To: Roland Dreier
> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> openib-general at openib.org
> Subject: Re: Performance Degradation with OFED v. Voltaire
> 
> >  > I think Eric described the major differences earlier on, 
> here it is, see
> >  > second half:
> > 
> > OK, I forgot about that.
> > 
> > I guess one last thing to check would be the MTU being used 
> for the RC
> > connections.  Since this is PCI-X HW then the MTU should be 1024 for
> > best throughput (instead of the max MTU of 2048).
> 
> The MTU issue is described in the OFED release notes.
> You must turn the Tavor work-around for it on in opensm.
> This was introduced late in release cycle to it was deemed safer
> to make it off by default.
> 
> By the way, Eitan, Hal, can we turn this on by default now?
> This was we'll get more feedback from people, and we'll still have
> time to turn it off before release if this unexpectedly 
> creates issues.
> 
> -- 
> MST
> 


From philippe_bernadat at hp.com  Fri Dec 15 00:44:14 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Fri, 15 Dec 2006 09:44:14 +0100
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
	(lustre)
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net>

I also looked at the HCA counters, and I indeed think 
there is something wrong about the MTU:

For the same test

With VIB

PortXmitData:                  2684490382
PortRcvData:                      1750145
PortXmitPkts:                    10280007
PortRcvPkts:                        49962

With OFED

XmtBytes:........................2653730483
RcvBytes:........................1710541
XmtPkts:.........................5160009
RcvPkts:.........................50012

Which means we sent half less packets with OFED 
and if you do the math it is 2K packets with OFED (counters are 32bit
units)
and 1K packets with VIB.

So fo some reason the tavor_quirk param is ignored/overwriten.
Is there an interface to control this ?

Philippe

> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Friday, December 15, 2006 8:59 AM
> To: Michael S. Tsirkin; Roland Dreier
> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
> Subject: RE: Performance Degradation with OFED v. Voltaire (lustre)
> 
> I have set tavor_quirk to 1 with no effect.
> Another thing I have tried is the same lustre 
> LNET echo test with a single thread (vs 8)
> 
> VIB:      400 MB/s
> OFED-1.1: 333 MB/s
> 
> I am posting the live param values for all infiniband 
> modules in case someone could identify some wrong setting:
> 
> infiniband/core/ib_cm
> 
> mra_timeout_limit              30000
> 
> infiniband/core/rdma_cm
> 
> max_cm_retries                    15
> tavor_quirk                        1
> 
> infiniband/hw/ipath/ib_ipath
> 
> cfgports                           0
> debug                              1
> disable_sma                        0
> kpiobufs                           0
> lkey_table_size                   12
> max_ahs                        65535
> max_cqes                      196607
> max_cqs                       131071
> max_mcast_grps                 16384
> max_mcast_qp_attached             16
> max_pds                        65535
> max_qps                        16384
> max_qp_wrs                     16383
> max_sges                          96
> max_srqs                        1024
> max_srq_sges                     128
> max_srq_wrs                   131071
> qp_table_size                    251
> 
> infiniband/hw/mthca/ib_mthca
> 
> catas_reset_disable                0
> debug_level                        0
> fmr_reserved_mtts             262144
> fw_cmd_doorbell                    0
> msi                                0
> msi_x                              1
> num_cq                         65536
> num_mcg                         8192
> num_mpt                       131072
> num_mtt                      1048576
> num_qp                         65536
> num_udav                       32768
> rdb_per_qp                         4
> tune_pci                           1
> 
> infiniband/ulp/ipoib/ib_ipoib
> 
> debug_level                        0
> mcast_debug_level                  0
> recv_queue_size                  128
> send_queue_size                   64
> 
> Philippe
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > Sent: Thursday, December 14, 2006 6:32 PM
> > To: Roland Dreier
> > Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> > openib-general at openib.org
> > Subject: Re: Performance Degradation with OFED v. Voltaire
> > 
> > >  > I think Eric described the major differences earlier on, 
> > here it is, see
> > >  > second half:
> > > 
> > > OK, I forgot about that.
> > > 
> > > I guess one last thing to check would be the MTU being used 
> > for the RC
> > > connections.  Since this is PCI-X HW then the MTU should 
> be 1024 for
> > > best throughput (instead of the max MTU of 2048).
> > 
> > The MTU issue is described in the OFED release notes.
> > You must turn the Tavor work-around for it on in opensm.
> > This was introduced late in release cycle to it was deemed safer
> > to make it off by default.
> > 
> > By the way, Eitan, Hal, can we turn this on by default now?
> > This was we'll get more feedback from people, and we'll still have
> > time to turn it off before release if this unexpectedly 
> > creates issues.
> > 
> > -- 
> > MST
> > 


From ali at alisheriff.orangehome.co.uk  Fri Dec 15 01:18:03 2006
From: ali at alisheriff.orangehome.co.uk (Ali Sheriff)
Date: Fri, 15 Dec 2006 10:18:03 +0100 (CET)
Subject: [openib-general] CONSIGNMENT AND CONTACT GODWIN AMALA
Message-ID: <27533855.18241166174283551.JavaMail.www@wwinf3101>

Sir,
Thank you for you.
I write to inform you that the sending of the consignment will only be possible because an agent will help you to open an account with OCEANIC BANK , then you can transfer the funds from OCEANIC BANK PLC,  to your account in any part of the world.
You  have to contact the Agent  on,
Name:Godwin Amala
Email:chineloadams13 at yahoo.fr
OR,godwin_amala1967 at yahoo.fr
Phone Number:+229 97 67 26 47
And you should contact him with the reconfirmation of your address and direct contact phone and cell numbers.
The purpose of the contact is to confirm that he  have received the funds and to confirm to him your are the real owner of the funds and for your wish for your inspection and deposit in the bank if you wish.The total sum he supposed to pay you is your compensation fund of $800,000,reserved by your Business associates here which his sceretary,Courier Company and Bank were unable to complete payments and hence exhort your money.
This process of movement of money is very classified and it is only accorded to GOLD CARD members of the ADB organisation and, this GOLD CARD members includes Heads of States in Africa,former ministers and very top Government officials in india and south America.  Through my contact, I have fronted you as a GOLD CARD member.
You should therefore present yourself as Gold card member.Your passcode is “AD411W7”.  You must mention this code to Mr Godwin Amala when you contact her before he can give you any information regarding the consignments.And then you should inform him that you are expecting some consignmentsfrom  COTONOU BENIN REPUBLIC,and that you wish to confirm if they have arrived. You must not let him know that you are NOT a Gold card member.
Note that you must memorize the numbers because you will be the person to open the consignment upon delivery and that is a strong proof of ownership and identity.
You must know that the only persons who know the contents of the consignments are your humble self,Godwin Amala and myself.
Please take note of all these instructions.  If you have any question,please do not hesitate to contact me by email.
Thank you.
Yours
Ali Sherif  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061215/977edc46/attachment.html>

From mlleinin at hpcn.ca.sandia.gov  Fri Dec 15 02:19:28 2006
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Fri, 15 Dec 2006 02:19:28 -0800
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire	(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net>
Message-ID: <1166177968.21763.116.camel@localhost>

On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote:
> I also looked at the HCA counters, and I indeed think 
> there is something wrong about the MTU:
> 
> For the same test
> 
> With VIB
> 
> PortXmitData:                  2684490382
> PortRcvData:                      1750145
> PortXmitPkts:                    10280007
> PortRcvPkts:                        49962
> 
> With OFED
> 
> XmtBytes:........................2653730483
> RcvBytes:........................1710541
> XmtPkts:.........................5160009
> RcvPkts:.........................50012
> 
> Which means we sent half less packets with OFED 
> and if you do the math it is 2K packets with OFED (counters are 32bit
> units)
> and 1K packets with VIB.
> 
> So fo some reason the tavor_quirk param is ignored/overwriten.
> Is there an interface to control this ?

  Michael said you have to turn on this feature in OpenSM.  From the
release notes I'm not sure how you turn it on in OpenSM.  You did turn
on the tavor mtu work around in the rdma_cm, but did you turn it on in
OpenSM?  Also what version of OpenSM are you running?

  Thanks,

	- Matt

> 
> Philippe
> 
> > -----Original Message-----
> > From: Bernadat, Philippe 
> > Sent: Friday, December 15, 2006 8:59 AM
> > To: Michael S. Tsirkin; Roland Dreier
> > Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
> > Subject: RE: Performance Degradation with OFED v. Voltaire (lustre)
> > 
> > I have set tavor_quirk to 1 with no effect.
> > Another thing I have tried is the same lustre 
> > LNET echo test with a single thread (vs 8)
> > 
> > VIB:      400 MB/s
> > OFED-1.1: 333 MB/s
> > 
> > I am posting the live param values for all infiniband 
> > modules in case someone could identify some wrong setting:
> > 
> > infiniband/core/ib_cm
> > 
> > mra_timeout_limit              30000
> > 
> > infiniband/core/rdma_cm
> > 
> > max_cm_retries                    15
> > tavor_quirk                        1
> > 
> > infiniband/hw/ipath/ib_ipath
> > 
> > cfgports                           0
> > debug                              1
> > disable_sma                        0
> > kpiobufs                           0
> > lkey_table_size                   12
> > max_ahs                        65535
> > max_cqes                      196607
> > max_cqs                       131071
> > max_mcast_grps                 16384
> > max_mcast_qp_attached             16
> > max_pds                        65535
> > max_qps                        16384
> > max_qp_wrs                     16383
> > max_sges                          96
> > max_srqs                        1024
> > max_srq_sges                     128
> > max_srq_wrs                   131071
> > qp_table_size                    251
> > 
> > infiniband/hw/mthca/ib_mthca
> > 
> > catas_reset_disable                0
> > debug_level                        0
> > fmr_reserved_mtts             262144
> > fw_cmd_doorbell                    0
> > msi                                0
> > msi_x                              1
> > num_cq                         65536
> > num_mcg                         8192
> > num_mpt                       131072
> > num_mtt                      1048576
> > num_qp                         65536
> > num_udav                       32768
> > rdb_per_qp                         4
> > tune_pci                           1
> > 
> > infiniband/ulp/ipoib/ib_ipoib
> > 
> > debug_level                        0
> > mcast_debug_level                  0
> > recv_queue_size                  128
> > send_queue_size                   64
> > 
> > Philippe
> > 
> > > -----Original Message-----
> > > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > > Sent: Thursday, December 14, 2006 6:32 PM
> > > To: Roland Dreier
> > > Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> > > openib-general at openib.org
> > > Subject: Re: Performance Degradation with OFED v. Voltaire
> > > 
> > > >  > I think Eric described the major differences earlier on, 
> > > here it is, see
> > > >  > second half:
> > > > 
> > > > OK, I forgot about that.
> > > > 
> > > > I guess one last thing to check would be the MTU being used 
> > > for the RC
> > > > connections.  Since this is PCI-X HW then the MTU should 
> > be 1024 for
> > > > best throughput (instead of the max MTU of 2048).
> > > 
> > > The MTU issue is described in the OFED release notes.
> > > You must turn the Tavor work-around for it on in opensm.
> > > This was introduced late in release cycle to it was deemed safer
> > > to make it off by default.
> > > 
> > > By the way, Eitan, Hal, can we turn this on by default now?
> > > This was we'll get more feedback from people, and we'll still have
> > > time to turn it off before release if this unexpectedly 
> > > creates issues.
> > > 
> > > -- 
> > > MST
> > > 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From jsquyres at cisco.com  Fri Dec 15 05:17:26 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 15 Dec 2006 08:17:26 -0500
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
Message-ID: <8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com>

These names still don't appear to exist.  Do we know when they'll be  
created?


On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote:

> Who controls the DNS for openfabrics.org?  Could we get these names
> created?  Or -- are there any objections to creating / using such  
> names?
>
> Thanks!
>
>
> On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote:
>
>> The name "staging.openfabrics.org" was really intended to be
>> temporary until the old openfabrics.org was taken offline and
>> replaced with the new one.
>>
>> My $0.02 is that we should stop using staging.openfabrics.org as
>> soon as possible and create / start using some new names for the
>> server to allow for potential transparent service relocation someday.
>>
>> Here are some new name suggestions that could be done immediately
>> (with appropriate changes to DNS, apache config, ...and potentially
>> others):
>>
>>  * git.openfabrics.org: for all git activity
>>  * wiki.openfabrics.org: a top-level name for the wiki rather than
>> burying it under several layers of links on the web site
>>  * trac.openfabrics.org: if someone creates this name, I volunteer
>> to finally get off my butt and install trac to see if people like it
>>
>> These are the old names and would need to be changed in DNS only
>> when the old server is taken offline / we're ready to move to the
>> new server:
>>
>>  * openfabrics.org: redirect to www.openfabrics.org, and for mail
>> traffic
>>  * www.openfabrics.org: main web site
>>
>> -- 
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>>
>>
>
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From halr at voltaire.com  Fri Dec 15 06:33:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 09:33:15 -0500
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
	'hang'
In-Reply-To: <4581ACE5.9000109@dev.mellanox.co.il>
References: <4581ACE5.9000109@dev.mellanox.co.il>
Message-ID: <1166193153.28709.186595.camel@hal.voltaire.com>

Hi again Yevgeny,

On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> This patch fixes a bug that caused ucast manager to return
> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> Added a boolean flag that marks whether there was some change or not
> (in which case OSM_SIGNAL_DONE should be returned).
> 
> --
> Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Good catch!

Thanks. Applied.

Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?

-- Hal


From halr at voltaire.com  Fri Dec 15 07:28:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 10:28:27 -0500
Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine
	[1/2]
In-Reply-To: <4581DDEF.7000206@dev.mellanox.co.il>
References: <4581DDEF.7000206@dev.mellanox.co.il>
Message-ID: <1166196463.28709.188818.camel@hal.voltaire.com>

On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> This patch (1/2) adds Fat Tree routing engine to OpenSM.
> 
> --
> Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/opensm/Makefile.am  |    2 +-
>  osm/opensm/main.c       |    3 ++-
>  osm/opensm/osm_opensm.c |    2 ++
>  3 files changed, 5 insertions(+), 2 deletions(-)

Thanks. Applied.

Note that these patches were in the reverse order.

-- Hal


From halr at voltaire.com  Fri Dec 15 07:36:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 10:36:10 -0500
Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine
	[2/2]
In-Reply-To: <4581DDFF.2000903@dev.mellanox.co.il>
References: <4581DDFF.2000903@dev.mellanox.co.il>
Message-ID: <1166196836.28709.188922.camel@hal.voltaire.com>

Hi Yevgeny,

On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> This patch (2/2) adds Fat Tree routing engine to OpenSM.

Thanks! Applied.

I played with it a little and will look more at it going forward.

A couple of questions:

Is this algorithm currently considered experimental ?

Are there any simulator tests/regressions for this ?

Also, could you or Eitan update doc/current-routing.txt with a
description of the fat tree algorithm and send that patch to me ?

-- Hal


From dotanb at dev.mellanox.co.il  Fri Dec 15 08:25:20 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Fri, 15 Dec 2006 18:25:20 +0200 (IST)
Subject: [openib-general] can i use the multicast module in user level?
Message-ID: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il>

Hi Sean.

I would like to use the multicast module in user level tests (in order to
send a join message to the multicast groups that I'm using).

Can I use the multicast module in user level?
(if the answer is yes, is there is any code reference that I can use?)


thanks
Dotan


From eitan at mellanox.co.il  Fri Dec 15 09:04:08 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 15 Dec 2006 19:04:08 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <1166193153.28709.186595.camel@hal.voltaire.com>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166193153.28709.186595.camel@hal.voltaire.com>
Message-ID: <4582D588.2070506@mellanox.co.il>

Hal Rosenstock wrote:
> Hi again Yevgeny,
>
> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
>   
>> Hi Hal
>>
>> This patch fixes a bug that caused ucast manager to return
>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
>> Added a boolean flag that marks whether there was some change or not
>> (in which case OSM_SIGNAL_DONE should be returned).
>>
>> --
>> Yevgeny
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>     
>
> Good catch!
>
> Thanks. Applied.
>
> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?
>   
I think OFED 1.1 does not have the "incremental" routing patch. So it 
does not have this bug.

EZ
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From xma at us.ibm.com  Fri Dec 15 09:06:03 2006
From: xma at us.ibm.com (Shirley Ma)
Date: Fri, 15 Dec 2006 09:06:03 -0800
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061215051438.GH19449@mellanox.co.il>
Message-ID: <OFC0CE579C.BE8985BF-ON87257245.005D816F-88257245.005DEFD2@us.ibm.com>


"Michael S. Tsirkin" <mst at mellanox.co.il> wrote on 12/14/2006 09:14:38 PM:

> > Hi, Michael,
> >
> > Tried this patch, it didn't work on ehca. I couldn't change the mode
from
> > datagram to connected from /sys/class.
>
> It's wroking as designed in that respect.  ehca does not implement
> srq - without
> srq, there is no way to prepost receive buffers for a resonable number of
> connections without running out of memory.
>
> So it is falling back on datagram mode.
> Talk to ehca guys to implement srq and connected mode will be enabled.
Don't remember SRQ is a MUST for UC mode. Does this patch support devices
with SRQ in RC mode?

> > And when unloading ib_ipoib module, all the connections to that node
gone,
> > rmmod ib_ipoib hung. The kernel is 2.6.19.
>
> Probably a bug in error handling somewhere.
> Post the sysrq t trace and I'll take a look.

I will recreate the problem and post stack trace later.

Thanks
Shirley Ma
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061215/16270d00/attachment.html>

From mlleinin at hpcn.ca.sandia.gov  Fri Dec 15 09:15:24 2006
From: mlleinin at hpcn.ca.sandia.gov (Matt Leininger)
Date: Fri, 15 Dec 2006 09:15:24 -0800
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com>
References: <18010248-A970-470B-B92C-592E16820CBA@cisco.com>
	<2B638F09-C037-4343-9A0F-A5A45AD34121@cisco.com>
	<8916AC51-131D-4AE5-A630-E72E5E3A90C1@cisco.com>
Message-ID: <1166202924.21763.124.camel@localhost>

On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote:
> These names still don't appear to exist.  Do we know when they'll be  
> created?

  Intel controls the openfabrics.org domain name.  I think Jim or
Michael can make this happen.

  - Matt

> 
> 
> On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote:
> 
> > Who controls the DNS for openfabrics.org?  Could we get these names
> > created?  Or -- are there any objections to creating / using such  
> > names?
> >
> > Thanks!
> >
> >
> > On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote:
> >
> >> The name "staging.openfabrics.org" was really intended to be
> >> temporary until the old openfabrics.org was taken offline and
> >> replaced with the new one.
> >>
> >> My $0.02 is that we should stop using staging.openfabrics.org as
> >> soon as possible and create / start using some new names for the
> >> server to allow for potential transparent service relocation someday.
> >>
> >> Here are some new name suggestions that could be done immediately
> >> (with appropriate changes to DNS, apache config, ...and potentially
> >> others):
> >>
> >>  * git.openfabrics.org: for all git activity
> >>  * wiki.openfabrics.org: a top-level name for the wiki rather than
> >> burying it under several layers of links on the web site
> >>  * trac.openfabrics.org: if someone creates this name, I volunteer
> >> to finally get off my butt and install trac to see if people like it
> >>
> >> These are the old names and would need to be changed in DNS only
> >> when the old server is taken offline / we're ready to move to the
> >> new server:
> >>
> >>  * openfabrics.org: redirect to www.openfabrics.org, and for mail
> >> traffic
> >>  * www.openfabrics.org: main web site
> >>
> >> -- 
> >> Jeff Squyres
> >> Server Virtualization Business Unit
> >> Cisco Systems
> >>
> >>
> >
> >
> > -- 
> > Jeff Squyres
> > Server Virtualization Business Unit
> > Cisco Systems
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> > openib-general
> 
> 


From eitan at sw053.yok.mtl.com  Fri Dec 15 09:10:58 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Fri, 15 Dec 2006 19:10:58 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-15:normal completion
Message-ID: <200612151710.kBFHAw1V004597@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = ____  
ibutils rev = ____  
Total=198 Pass=198 Fail=0

Pass:
27 Stability IS1-16.topo
27 Pkey IS1-16.topo
27 OsmStress IS1-16.topo
27 Multicast IS1-16.topo
27 LidMgr IS1-16.topo
9 Stability IS3-loop.topo
9 Stability IS3-128.topo
9 Pkey IS3-128.topo
9 OsmStress IS3-128.topo
9 Multicast IS3-loop.topo
9 Multicast IS3-128.topo
9 LidMgr IS3-128.topo

Failures:


From jim.ryan at intel.com  Fri Dec 15 09:17:47 2006
From: jim.ryan at intel.com (Ryan, Jim)
Date: Fri, 15 Dec 2006 09:17:47 -0800
Subject: [openib-general] <new>.openfabrics.org names
Message-ID: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com>

Michael has done this in the past but he's on sabbatical and unavailable
for several weeks. Can someone else do this?

Thanks, Jim

-----Original Message-----
From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov] 
Sent: Friday, December 15, 2006 9:15 AM
To: Jeff Squyres
Cc: openib; Ryan, Jim; Oros, Michael
Subject: Re: [openib-general] <new>.openfabrics.org names

On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote:
> These names still don't appear to exist.  Do we know when they'll be  
> created?

  Intel controls the openfabrics.org domain name.  I think Jim or
Michael can make this happen.

  - Matt

> 
> 
> On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote:
> 
> > Who controls the DNS for openfabrics.org?  Could we get these names
> > created?  Or -- are there any objections to creating / using such  
> > names?
> >
> > Thanks!
> >
> >
> > On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote:
> >
> >> The name "staging.openfabrics.org" was really intended to be
> >> temporary until the old openfabrics.org was taken offline and
> >> replaced with the new one.
> >>
> >> My $0.02 is that we should stop using staging.openfabrics.org as
> >> soon as possible and create / start using some new names for the
> >> server to allow for potential transparent service relocation
someday.
> >>
> >> Here are some new name suggestions that could be done immediately
> >> (with appropriate changes to DNS, apache config, ...and potentially
> >> others):
> >>
> >>  * git.openfabrics.org: for all git activity
> >>  * wiki.openfabrics.org: a top-level name for the wiki rather than
> >> burying it under several layers of links on the web site
> >>  * trac.openfabrics.org: if someone creates this name, I volunteer
> >> to finally get off my butt and install trac to see if people like
it
> >>
> >> These are the old names and would need to be changed in DNS only
> >> when the old server is taken offline / we're ready to move to the
> >> new server:
> >>
> >>  * openfabrics.org: redirect to www.openfabrics.org, and for mail
> >> traffic
> >>  * www.openfabrics.org: main web site
> >>
> >> -- 
> >> Jeff Squyres
> >> Server Virtualization Business Unit
> >> Cisco Systems
> >>
> >>
> >
> >
> > -- 
> > Jeff Squyres
> > Server Virtualization Business Unit
> > Cisco Systems
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> > openib-general
> 
> 


From eitan at mellanox.co.il  Fri Dec 15 09:20:03 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 15 Dec 2006 19:20:03 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
 (lustre)
In-Reply-To: <1166177968.21763.116.camel@localhost>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net>
	<1166177968.21763.116.camel@localhost>
Message-ID: <4582D943.2080403@mellanox.co.il>

Matt Leininger wrote:
> On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote:
>   
>> I also looked at the HCA counters, and I indeed think 
>> there is something wrong about the MTU:
>>
>> For the same test
>>
>> With VIB
>>
>> PortXmitData:                  2684490382
>> PortRcvData:                      1750145
>> PortXmitPkts:                    10280007
>> PortRcvPkts:                        49962
>>
>> With OFED
>>
>> XmtBytes:........................2653730483
>> RcvBytes:........................1710541
>> XmtPkts:.........................5160009
>> RcvPkts:.........................50012
>>
>> Which means we sent half less packets with OFED 
>> and if you do the math it is 2K packets with OFED (counters are 32bit
>> units)
>> and 1K packets with VIB.
>>
>> So fo some reason the tavor_quirk param is ignored/overwriten.
>> Is there an interface to control this ?
>>     
>
>   Michael said you have to turn on this feature in OpenSM.  From the
> release notes I'm not sure how you turn it on in OpenSM.  You did turn
> on the tavor mtu work around in the rdma_cm, but did you turn it on in
> OpenSM?  Also what version of OpenSM are you running?
>   
To turn this option on in opensm you need to:
1. Run: opensm -c -o
2. Modify the file /var/cache/osm/opensm.opts by changing the line below
enable_quirks FALSE
to
enable_quirks TRUE

3. Run: opensm
>   Thanks,
>
> 	- Matt
>
>   
>> Philippe
>>
>>     
>>> -----Original Message-----
>>> From: Bernadat, Philippe 
>>> Sent: Friday, December 15, 2006 8:59 AM
>>> To: Michael S. Tsirkin; Roland Dreier
>>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
>>> Subject: RE: Performance Degradation with OFED v. Voltaire (lustre)
>>>
>>> I have set tavor_quirk to 1 with no effect.
>>> Another thing I have tried is the same lustre 
>>> LNET echo test with a single thread (vs 8)
>>>
>>> VIB:      400 MB/s
>>> OFED-1.1: 333 MB/s
>>>
>>> I am posting the live param values for all infiniband 
>>> modules in case someone could identify some wrong setting:
>>>
>>> infiniband/core/ib_cm
>>>
>>> mra_timeout_limit              30000
>>>
>>> infiniband/core/rdma_cm
>>>
>>> max_cm_retries                    15
>>> tavor_quirk                        1
>>>
>>> infiniband/hw/ipath/ib_ipath
>>>
>>> cfgports                           0
>>> debug                              1
>>> disable_sma                        0
>>> kpiobufs                           0
>>> lkey_table_size                   12
>>> max_ahs                        65535
>>> max_cqes                      196607
>>> max_cqs                       131071
>>> max_mcast_grps                 16384
>>> max_mcast_qp_attached             16
>>> max_pds                        65535
>>> max_qps                        16384
>>> max_qp_wrs                     16383
>>> max_sges                          96
>>> max_srqs                        1024
>>> max_srq_sges                     128
>>> max_srq_wrs                   131071
>>> qp_table_size                    251
>>>
>>> infiniband/hw/mthca/ib_mthca
>>>
>>> catas_reset_disable                0
>>> debug_level                        0
>>> fmr_reserved_mtts             262144
>>> fw_cmd_doorbell                    0
>>> msi                                0
>>> msi_x                              1
>>> num_cq                         65536
>>> num_mcg                         8192
>>> num_mpt                       131072
>>> num_mtt                      1048576
>>> num_qp                         65536
>>> num_udav                       32768
>>> rdb_per_qp                         4
>>> tune_pci                           1
>>>
>>> infiniband/ulp/ipoib/ib_ipoib
>>>
>>> debug_level                        0
>>> mcast_debug_level                  0
>>> recv_queue_size                  128
>>> send_queue_size                   64
>>>
>>> Philippe
>>>
>>>       
>>>> -----Original Message-----
>>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
>>>> Sent: Thursday, December 14, 2006 6:32 PM
>>>> To: Roland Dreier
>>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
>>>> openib-general at openib.org
>>>> Subject: Re: Performance Degradation with OFED v. Voltaire
>>>>
>>>>         
>>>>>  > I think Eric described the major differences earlier on, 
>>>>>           
>>>> here it is, see
>>>>         
>>>>>  > second half:
>>>>>
>>>>> OK, I forgot about that.
>>>>>
>>>>> I guess one last thing to check would be the MTU being used 
>>>>>           
>>>> for the RC
>>>>         
>>>>> connections.  Since this is PCI-X HW then the MTU should 
>>>>>           
>>> be 1024 for
>>>       
>>>>> best throughput (instead of the max MTU of 2048).
>>>>>           
>>>> The MTU issue is described in the OFED release notes.
>>>> You must turn the Tavor work-around for it on in opensm.
>>>> This was introduced late in release cycle to it was deemed safer
>>>> to make it off by default.
>>>>
>>>> By the way, Eitan, Hal, can we turn this on by default now?
>>>> This was we'll get more feedback from people, and we'll still have
>>>> time to turn it off before release if this unexpectedly 
>>>> creates issues.
>>>>
>>>> -- 
>>>> MST
>>>>
>>>>         
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>>     
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Fri Dec 15 09:30:28 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 15 Dec 2006 19:30:28 +0200
Subject: [openib-general] libsdp: RFC changing libsdp.conf location
In-Reply-To: <457D3269.3070401@mellanox.co.il>
References: <457D2A21.9030804@mellanox.co.il>
	<20061211102222.GB5944@mellanox.co.il>
	<457D3269.3070401@mellanox.co.il>
Message-ID: <4582DBB4.80605@mellanox.co.il>

Hi Roland, Scott, Nimrod, MST,

Thanks for your feedbacks on the issue over the last week.
What I plan to do:
1. Move the default location to /etc/libsdp.conf
2. Mark the file with %config in so it is not overwritten by the RPM install
3. Change the "make install" to not overwrite the file but to create a 
file named /etc/libsdp.conf.example if a file exists

Eitan

Eitan Zahavi wrote:
> Hi Michael,
>
> Thanks. This proposal is simple and clear to me.
> Let's wait a day and see if anybody else have other ideas.
>
> Thanks
>
> Eitan
>
> Michael S. Tsirkin wrote:
>   
>>> BTW: libsdp.conf used to be overwritten in previous install.
>>> I have fixed the nakefile to avoid that and instead create a
>>> new file with install date under the same directory.
>>>     
>>>       
>> Here's a simple proposal that will address this issue:
>> - Make libsdp behave sanely when not libsdp.conf file is present.
>>   Do not install anything in default location in make install.
>>
>> - in make install, copy the example configuration file into
>>   libsdp.conf.example. Add a line to the top of it saying
>>   "rename this file to libsdp.conf to make lbisdp use it".
>>
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mshefty at ichips.intel.com  Fri Dec 15 09:37:59 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 15 Dec 2006 09:37:59 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
	<15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
	<4581C4B5.5020702@ichips.intel.com>
	<15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com>
Message-ID: <4582DD77.8090208@ichips.intel.com>

> cool, before sending the orig email i was looking on both Arlin git
> tree at ofa staging and the svn and the code that uses this calls are
> still there, so were are the updated udapl sources?

Arlin's DAPL tree has an rdma_ucm branch that should match.

- Sean


From mshefty at ichips.intel.com  Fri Dec 15 09:43:41 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 15 Dec 2006 09:43:41 -0800
Subject: [openib-general] can i use the multicast module in user level?
In-Reply-To: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il>
References: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il>
Message-ID: <4582DECD.70301@ichips.intel.com>

> I would like to use the multicast module in user level tests (in order to
> send a join message to the multicast groups that I'm using).
> 
> Can I use the multicast module in user level?
> (if the answer is yes, is there is any code reference that I can use?)

Multicast support has only been exposed to userspace through the librdmacm. 
There's a mckey test app that shows how this can be used.

I will be working on a raw IB multicast / InformInfo userspace support through 
January.  There is an older userspace SA library that you might be able to play 
with as well, but you'd have to look back through the mail logs to find the patches.

- Sean


From halr at voltaire.com  Fri Dec 15 10:47:26 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 13:47:26 -0500
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <4582D588.2070506@mellanox.co.il>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166193153.28709.186595.camel@hal.voltaire.com>
	<4582D588.2070506@mellanox.co.il>
Message-ID: <1166208365.28709.195843.camel@hal.voltaire.com>

On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > Hi again Yevgeny,
> >
> > On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> >   
> >> Hi Hal
> >>
> >> This patch fixes a bug that caused ucast manager to return
> >> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> >> Added a boolean flag that marks whether there was some change or not
> >> (in which case OSM_SIGNAL_DONE should be returned).
> >>
> >> --
> >> Yevgeny
> >>
> >> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> >>     
> >
> > Good catch!
> >
> > Thanks. Applied.
> >
> > Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?
> >   
> I think OFED 1.1 does not have the "incremental" routing patch.

Right; it doesn't.

> So it does not have this bug.

Are you sure that the incremental routing caused this to be needed ? By
any chance, are you confusing this with a different patch ? Just want to
be clear on this...

-- Hal

> EZ
> > -- Hal
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From wombat2 at us.ibm.com  Fri Dec 15 11:01:37 2006
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Fri, 15 Dec 2006 14:01:37 -0500
Subject: [openib-general] Fw: openib-general Digest, Vol 30, Issue 135
Message-ID: <OF163DE11D.BB337468-ON85257245.0067968E-85257245.00688586@us.ibm.com>

> ----- Message from "Shirley Ma" <xma at us.ibm.com> on Fri, 15 Dec 2006
> 09:06:03 -0800 -----
> 
> To:
> 
> "Michael S. Tsirkin" <mst at mellanox.co.il>
> 
> cc:
> 
> openib-general at openib.org
> 
> Subject:
> 
> Re: [openib-general] [PATCHv2] IPoIB CM Experimental support
> 
> "Michael S. Tsirkin" <mst at mellanox.co.il> wrote on 12/14/2006 09:14:38 
PM:
> 
> > > Hi, Michael,
> > > 
> > > Tried this patch, it didn't work on ehca. I couldn't change the mode 
from
> > > datagram to connected from /sys/class.
> > 
> > It's wroking as designed in that respect.  ehca does not implement 
> > srq - without
> > srq, there is no way to prepost receive buffers for a resonable number 
of
> > connections without running out of memory.
> > 
> > So it is falling back on datagram mode.
> > Talk to ehca guys to implement srq and connected mode will be enabled.
> Don't remember SRQ is a MUST for UC mode. Does this patch support 
> devices with SRQ in RC mode?

I don't think the IB HCA Spec requires SRQ support for RC but is an 
optional feature. There are two adapters right now that don't support SRQ 
which means to use IPoIB-CM on them you should make the use of SRQ an 
option setting. I agree that if it is available it should be used for 
scaling issues probably if available automatically set. But I would like 
to see us at least support the current hardware that meets the current 
SPEC.

Bernie King-Smith 
IBM Corporation
Server Group
Cluster System Performance 
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES 

"We are not responsible for the world we are born into, only for the world 
we leave when we die.
So we have to accept what has gone before us and work to change the only 
thing we can,
-- The Future." William Shatner
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061215/bbba814b/attachment.html>

From robert.j.woodruff at intel.com  Fri Dec 15 11:06:43 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 15 Dec 2006 11:06:43 -0800
Subject: [openib-general] OpenSM core dump - file size exceeded
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C01580B41@orsmsx418.amr.corp.intel.com>

My OpenSM, from the git tree pulled on 12/12/06 died with the following
error,
looks like the log file got > 2G and then it died.


[root at iclust-2 RPMS]# ps -aux | grep opensm
Warning: bad syntax, perhaps a bogus '-'? See
/usr/share/doc/procps-3.2.3/FAQ
root     20256  0.0  0.1  5408  656 pts/4    S+   12:05   0:00 grep
opensm
[1]+  File size limit exceeded(core dumped) /usr/local/bin/opensm
[root at iclust-2 RPMS]# ls -l /var/log/osm.log 
-rw-r--r--  1 root root 2147483647 Dec 14 17:25 /var/log/osm.log
[root at iclust-2 RPMS]# 


woody


From halr at voltaire.com  Fri Dec 15 11:14:58 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 14:14:58 -0500
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
 (lustre)
In-Reply-To: <4582D943.2080403@mellanox.co.il>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05538379@idaexc03.emea.cpqcorp.net>
	<1166177968.21763.116.camel@localhost>
	<4582D943.2080403@mellanox.co.il>
Message-ID: <1166210069.28709.196688.camel@hal.voltaire.com>

On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote:
> Matt Leininger wrote:
> > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote:
> >   
> >> I also looked at the HCA counters, and I indeed think 
> >> there is something wrong about the MTU:
> >>
> >> For the same test
> >>
> >> With VIB
> >>
> >> PortXmitData:                  2684490382
> >> PortRcvData:                      1750145
> >> PortXmitPkts:                    10280007
> >> PortRcvPkts:                        49962
> >>
> >> With OFED
> >>
> >> XmtBytes:........................2653730483
> >> RcvBytes:........................1710541
> >> XmtPkts:.........................5160009
> >> RcvPkts:.........................50012
> >>
> >> Which means we sent half less packets with OFED 
> >> and if you do the math it is 2K packets with OFED (counters are 32bit
> >> units)
> >> and 1K packets with VIB.
> >>
> >> So fo some reason the tavor_quirk param is ignored/overwriten.
> >> Is there an interface to control this ?
> >>     
> >
> >   Michael said you have to turn on this feature in OpenSM.  From the
> > release notes I'm not sure how you turn it on in OpenSM.  You did turn
> > on the tavor mtu work around in the rdma_cm, but did you turn it on in
> > OpenSM?  Also what version of OpenSM are you running?
> >   
> To turn this option on in opensm you need to:
> 1. Run: opensm -c -o

If you already have an opensm.opts file then you can skip this step.

-- Hal

> 2. Modify the file /var/cache/osm/opensm.opts by changing the line below
> enable_quirks FALSE
> to
> enable_quirks TRUE
> 
> 3. Run: opensm
> >   Thanks,
> >
> > 	- Matt
> >
> >   
> >> Philippe
> >>
> >>     
> >>> -----Original Message-----
> >>> From: Bernadat, Philippe 
> >>> Sent: Friday, December 15, 2006 8:59 AM
> >>> To: Michael S. Tsirkin; Roland Dreier
> >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
> >>> Subject: RE: Performance Degradation with OFED v. Voltaire (lustre)
> >>>
> >>> I have set tavor_quirk to 1 with no effect.
> >>> Another thing I have tried is the same lustre 
> >>> LNET echo test with a single thread (vs 8)
> >>>
> >>> VIB:      400 MB/s
> >>> OFED-1.1: 333 MB/s
> >>>
> >>> I am posting the live param values for all infiniband 
> >>> modules in case someone could identify some wrong setting:
> >>>
> >>> infiniband/core/ib_cm
> >>>
> >>> mra_timeout_limit              30000
> >>>
> >>> infiniband/core/rdma_cm
> >>>
> >>> max_cm_retries                    15
> >>> tavor_quirk                        1
> >>>
> >>> infiniband/hw/ipath/ib_ipath
> >>>
> >>> cfgports                           0
> >>> debug                              1
> >>> disable_sma                        0
> >>> kpiobufs                           0
> >>> lkey_table_size                   12
> >>> max_ahs                        65535
> >>> max_cqes                      196607
> >>> max_cqs                       131071
> >>> max_mcast_grps                 16384
> >>> max_mcast_qp_attached             16
> >>> max_pds                        65535
> >>> max_qps                        16384
> >>> max_qp_wrs                     16383
> >>> max_sges                          96
> >>> max_srqs                        1024
> >>> max_srq_sges                     128
> >>> max_srq_wrs                   131071
> >>> qp_table_size                    251
> >>>
> >>> infiniband/hw/mthca/ib_mthca
> >>>
> >>> catas_reset_disable                0
> >>> debug_level                        0
> >>> fmr_reserved_mtts             262144
> >>> fw_cmd_doorbell                    0
> >>> msi                                0
> >>> msi_x                              1
> >>> num_cq                         65536
> >>> num_mcg                         8192
> >>> num_mpt                       131072
> >>> num_mtt                      1048576
> >>> num_qp                         65536
> >>> num_udav                       32768
> >>> rdb_per_qp                         4
> >>> tune_pci                           1
> >>>
> >>> infiniband/ulp/ipoib/ib_ipoib
> >>>
> >>> debug_level                        0
> >>> mcast_debug_level                  0
> >>> recv_queue_size                  128
> >>> send_queue_size                   64
> >>>
> >>> Philippe
> >>>
> >>>       
> >>>> -----Original Message-----
> >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> >>>> Sent: Thursday, December 14, 2006 6:32 PM
> >>>> To: Roland Dreier
> >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> >>>> openib-general at openib.org
> >>>> Subject: Re: Performance Degradation with OFED v. Voltaire
> >>>>
> >>>>         
> >>>>>  > I think Eric described the major differences earlier on, 
> >>>>>           
> >>>> here it is, see
> >>>>         
> >>>>>  > second half:
> >>>>>
> >>>>> OK, I forgot about that.
> >>>>>
> >>>>> I guess one last thing to check would be the MTU being used 
> >>>>>           
> >>>> for the RC
> >>>>         
> >>>>> connections.  Since this is PCI-X HW then the MTU should 
> >>>>>           
> >>> be 1024 for
> >>>       
> >>>>> best throughput (instead of the max MTU of 2048).
> >>>>>           
> >>>> The MTU issue is described in the OFED release notes.
> >>>> You must turn the Tavor work-around for it on in opensm.
> >>>> This was introduced late in release cycle to it was deemed safer
> >>>> to make it off by default.
> >>>>
> >>>> By the way, Eitan, Hal, can we turn this on by default now?
> >>>> This was we'll get more feedback from people, and we'll still have
> >>>> time to turn it off before release if this unexpectedly 
> >>>> creates issues.
> >>>>
> >>>> -- 
> >>>> MST
> >>>>
> >>>>         
> >> _______________________________________________
> >> openib-general mailing list
> >> openib-general at openib.org
> >> http://openib.org/mailman/listinfo/openib-general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>
> >>     
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From ardavis at ichips.intel.com  Fri Dec 15 11:30:58 2006
From: ardavis at ichips.intel.com (Arlin Davis)
Date: Fri, 15 Dec 2006 11:30:58 -0800
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <4581C4B5.5020702@ichips.intel.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
	<15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
	<4581C4B5.5020702@ichips.intel.com>
Message-ID: <4582F7F2.8040305@ichips.intel.com>

Sean Hefty wrote:

>>I see. I understand that there is some code which is part of OFED
>>(udapl) that uses this api, what were you thinking to suggest them to
>>do in the spirit of this code you have posted being the basis for OFED
>>1.2 ?
>>    
>>
>
>DAPL has been updated to remove its use of these calls.  The rdma cm timeout is 
>essentially 1 minute now.  If needed a kernel fix can be applied to send an MRA 
>to increase the timeout, but I'm holding off on doing that unless it's really 
>needed.
>  
>
Not sure if one size fits all. Is one minute sufficient? Can you at 
least provide module parameters that can override your defaults when the 
driver loads. It would nice to have some control over extending the 
accept times if necessary. Maybe something at listen time that could 
indicate the need to send the MRA with a backoff time?  

-arlin

>  
>


From mshefty at ichips.intel.com  Fri Dec 15 11:46:31 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Fri, 15 Dec 2006 11:46:31 -0800
Subject: [openib-general] OpenSM core dump - file size exceeded
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C01580B41@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C01580B41@orsmsx418.amr.corp.intel.com>
Message-ID: <4582FB97.6010304@ichips.intel.com>

> [root at iclust-2 RPMS]# ps -aux | grep opensm
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.3/FAQ
> root     20256  0.0  0.1  5408  656 pts/4    S+   12:05   0:00 grep
> opensm
> [1]+  File size limit exceeded(core dumped) /usr/local/bin/opensm
> [root at iclust-2 RPMS]# ls -l /var/log/osm.log 
> -rw-r--r--  1 root root 2147483647 Dec 14 17:25 /var/log/osm.log
> [root at iclust-2 RPMS]# 

Looking at the log file, the problem appears to be related to:

http://openib.org/pipermail/openib-general/2006-December/029962.html

I'm still trying to discover more details.

- Sean


From eitan at mellanox.co.il  Fri Dec 15 12:03:42 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 15 Dec 2006 22:03:42 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <1166208365.28709.195843.camel@hal.voltaire.com>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166193153.28709.186595.camel@hal.voltaire.com>
	<4582D588.2070506@mellanox.co.il>
	<1166208365.28709.195843.camel@hal.voltaire.com>
Message-ID: <4582FF9E.3040901@mellanox.co.il>

Hal Rosenstock wrote:
> On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
>   
>> Hal Rosenstock wrote:
>>     
>>> Hi again Yevgeny,
>>>
>>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
>>>   
>>>       
>>>> Hi Hal
>>>>
>>>> This patch fixes a bug that caused ucast manager to return
>>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
>>>> Added a boolean flag that marks whether there was some change or not
>>>> (in which case OSM_SIGNAL_DONE should be returned).
>>>>
>>>> --
>>>> Yevgeny
>>>>
>>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>>>     
>>>>         
>>> Good catch!
>>>
>>> Thanks. Applied.
>>>
>>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?
>>>   
>>>       
>> I think OFED 1.1 does not have the "incremental" routing patch.
>>     
>
> Right; it doesn't.
>
>   
>> So it does not have this bug.
>>     
>
> Are you sure that the incremental routing caused this to be needed ? By
> any chance, are you confusing this with a different patch ? Just want to
> be clear on this...
>   
Yes I am sure. Without the new incremental feature every sweep all LFT 
tables were set.
EZ
> -- Hal
>
>   
>> EZ
>>     
>>> -- Hal
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Fri Dec 15 12:44:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 15:44:03 -0500
Subject: [openib-general] OpenSM core dump - file size exceeded
In-Reply-To: <BAE9DCEF64577A439B3A37F36F9B691C01580B41@orsmsx418.amr.corp.intel.com>
References: <BAE9DCEF64577A439B3A37F36F9B691C01580B41@orsmsx418.amr.corp.intel.com>
Message-ID: <1166215361.28709.199852.camel@hal.voltaire.com>

On Fri, 2006-12-15 at 14:06, Woodruff, Robert J wrote:
> My OpenSM, from the git tree pulled on 12/12/06 died with the following
> error,
> looks like the log file got > 2G and then it died.
> 
> 
> [root at iclust-2 RPMS]# ps -aux | grep opensm
> Warning: bad syntax, perhaps a bogus '-'? See
> /usr/share/doc/procps-3.2.3/FAQ
> root     20256  0.0  0.1  5408  656 pts/4    S+   12:05   0:00 grep
> opensm
> [1]+  File size limit exceeded(core dumped) /usr/local/bin/opensm
> [root at iclust-2 RPMS]# ls -l /var/log/osm.log 
> -rw-r--r--  1 root root 2147483647 Dec 14 17:25 /var/log/osm.log
> [root at iclust-2 RPMS]# 

Any idea what filled up the log ? but that's a side issue.

This has been discussed on the list before. This is one option which can
help with this issue:

        -L, --log_limit <size in MB>
              This  option defines maximal log file size in MB. When specified
              the log file will be truncated upon reaching this limit.

Is this useful ? (It was put in the last time you reported this failure).

Also, log rotation will be supported for OFED 1.2 but I've not had a
chance to incorporate this yet.

-- Hal

> woody


From halr at voltaire.com  Fri Dec 15 13:14:51 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 16:14:51 -0500
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <4582FF9E.3040901@mellanox.co.il>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166193153.28709.186595.camel@hal.voltaire.com>
	<4582D588.2070506@mellanox.co.il>
	<1166208365.28709.195843.camel@hal.voltaire.com>
	<4582FF9E.3040901@mellanox.co.il>
Message-ID: <1166217285.32666.579.camel@hal.voltaire.com>

On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
> >   
> >> Hal Rosenstock wrote:
> >>     
> >>> Hi again Yevgeny,
> >>>
> >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> >>>   
> >>>       
> >>>> Hi Hal
> >>>>
> >>>> This patch fixes a bug that caused ucast manager to return
> >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> >>>> Added a boolean flag that marks whether there was some change or not
> >>>> (in which case OSM_SIGNAL_DONE should be returned).
> >>>>
> >>>> --
> >>>> Yevgeny
> >>>>
> >>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> >>>>     
> >>>>         
> >>> Good catch!
> >>>
> >>> Thanks. Applied.
> >>>
> >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?
> >>>   
> >>>       
> >> I think OFED 1.1 does not have the "incremental" routing patch.
> >>     
> >
> > Right; it doesn't.
> >
> >   
> >> So it does not have this bug.
> >>     
> >
> > Are you sure that the incremental routing caused this to be needed ? By
> > any chance, are you confusing this with a different patch ? Just want to
> > be clear on this...
> >   
> Yes I am sure. Without the new incremental feature every sweep all LFT 
> tables were set.

That sounds like a different bug to me. Yevgeny's patch was for a hang
which involved issuing OSM_SIGNAL_DONE_PENDING rather than
OSM_SIGNAL_DONE. Is this related to incremental routing ?

-- Hal

> EZ
> > -- Hal
> >
> >   
> >> EZ
> >>     
> >>> -- Hal
> >>>
> >>>
> >>> _______________________________________________
> >>> openib-general mailing list
> >>> openib-general at openib.org
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>   
> >>>       
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From eitan at mellanox.co.il  Fri Dec 15 13:26:23 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Fri, 15 Dec 2006 23:26:23 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager
 to'hang'
Message-ID: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com>

Hi Hal,

Every osm manager (step in the algorithm) shall return 
OSM_SIGNAL_DONE_PENDING iff there are outstanding packets on the wire.
Or it should return OSM_SIGNAL_DONE if there are none.
The state manager uses there values to determine if it needs to wait for
all these SMPs to finish or
can progress to the next step.

This is a quote from the osm_ucast_mgr.c:
  /*
    For now don't bother checking if the switch forwarding tables
    actually needed updating.  The current code will always update
    them, and thus leave transactions pending on the wire.
    Therefore, return OSM_SIGNAL_DONE_PENDING.
  */
  signal = OSM_SIGNAL_DONE_PENDING;

This assumption was broken by the change avoiding sending Set(LFT) if
they did not change.

So the osm_state_mgr was stuck at the stage 
OSM_SM_STATE_SET_UCAST_TABLES_WAIT 
And never get a OSM_SIGNAL_NO_PENDING_TRANSACTIONS to exit it (since
there are no outstanding SMPs).

EZ

> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com]
> Sent: Friday, December 15, 2006 11:15 PM
> To: Eitan Zahavi
> Cc: OPENIB
> Subject: Re: [openib-general] [PATCH] osm: bug that caused ucast
manager
> to'hang'
> 
> On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
> > >
> > >> Hal Rosenstock wrote:
> > >>
> > >>> Hi again Yevgeny,
> > >>>
> > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> > >>>
> > >>>
> > >>>> Hi Hal
> > >>>>
> > >>>> This patch fixes a bug that caused ucast manager to return
> > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending
> transactions.
> > >>>> Added a boolean flag that marks whether there was some change
or
> > >>>> not (in which case OSM_SIGNAL_DONE should be returned).
> > >>>>
> > >>>> --
> > >>>> Yevgeny
> > >>>>
> > >>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> > >>>>
> > >>>>
> > >>> Good catch!
> > >>>
> > >>> Thanks. Applied.
> > >>>
> > >>> Is this issue (and patch or a similar one) also applicable to
OFED 1.1 ?
> > >>>
> > >>>
> > >> I think OFED 1.1 does not have the "incremental" routing patch.
> > >>
> > >
> > > Right; it doesn't.
> > >
> > >
> > >> So it does not have this bug.
> > >>
> > >
> > > Are you sure that the incremental routing caused this to be needed
?
> > > By any chance, are you confusing this with a different patch ?
Just
> > > want to be clear on this...
> > >
> > Yes I am sure. Without the new incremental feature every sweep all
LFT
> > tables were set.
> 
> That sounds like a different bug to me. Yevgeny's patch was for a hang
which
> involved issuing OSM_SIGNAL_DONE_PENDING rather than
> OSM_SIGNAL_DONE. Is this related to incremental routing ?
> 
> -- Hal
> 
> > EZ
> > > -- Hal
> > >
> > >
> > >> EZ
> > >>
> > >>> -- Hal
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> openib-general mailing list
> > >>> openib-general at openib.org
> > >>> http://openib.org/mailman/listinfo/openib-general
> > >>>
> > >>> To unsubscribe, please visit
> > >>> http://openib.org/mailman/listinfo/openib-general
> > >>>
> > >>>
> > >
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> >
> 


From halr at voltaire.com  Fri Dec 15 13:28:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 16:28:08 -0500
Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_subnet.c: Fix
 port_profile_switch_nodes comment in opensm.opts
Message-ID: <1166218072.32666.1192.camel@hal.voltaire.com>

OpenSM/osm_subnet.c: Fix port_profile_switch_nodes comment in
opensm.opts

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index c218790..3db4612 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -1137,7 +1137,7 @@ osm_subn_write_conf_file(
   fprintf( 
     opts_file,
     "#\n# ROUTING OPTIONS\n#\n"
-    "# If TRUE do not count switches as link subscriptions\n"
+    "# If TRUE count switches as link subscriptions\n"
     "port_profile_switch_nodes %s\n\n",
     p_opts->port_profile_switch_nodes ? "TRUE" : "FALSE");
 

From halr at voltaire.com  Fri Dec 15 13:31:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 16:31:57 -0500
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager
 to'hang'
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C980BADC@mtlexch01.mtl.com>
Message-ID: <1166218316.32666.1349.camel@hal.voltaire.com>

Hi Eitan,

On Fri, 2006-12-15 at 16:26, Eitan Zahavi wrote:
> Hi Hal,
> 
> Every osm manager (step in the algorithm) shall return 
> OSM_SIGNAL_DONE_PENDING iff there are outstanding packets on the wire.
> Or it should return OSM_SIGNAL_DONE if there are none.
> The state manager uses there values to determine if it needs to wait for
> all these SMPs to finish or
> can progress to the next step.
> 
> This is a quote from the osm_ucast_mgr.c:
>   /*
>     For now don't bother checking if the switch forwarding tables
>     actually needed updating.  The current code will always update
>     them, and thus leave transactions pending on the wire.
>     Therefore, return OSM_SIGNAL_DONE_PENDING.
>   */
>   signal = OSM_SIGNAL_DONE_PENDING;
> 
> This assumption was broken by the change avoiding sending Set(LFT) if
> they did not change.
> 
> So the osm_state_mgr was stuck at the stage 
> OSM_SM_STATE_SET_UCAST_TABLES_WAIT 
> And never get a OSM_SIGNAL_NO_PENDING_TRANSACTIONS to exit it (since
> there are no outstanding SMPs).

Got it. Thanks.

-- Hal

> EZ
> 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com]
> > Sent: Friday, December 15, 2006 11:15 PM
> > To: Eitan Zahavi
> > Cc: OPENIB
> > Subject: Re: [openib-general] [PATCH] osm: bug that caused ucast
> manager
> > to'hang'
> > 
> > On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote:
> > > Hal Rosenstock wrote:
> > > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
> > > >
> > > >> Hal Rosenstock wrote:
> > > >>
> > > >>> Hi again Yevgeny,
> > > >>>
> > > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> > > >>>
> > > >>>
> > > >>>> Hi Hal
> > > >>>>
> > > >>>> This patch fixes a bug that caused ucast manager to return
> > > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending
> > transactions.
> > > >>>> Added a boolean flag that marks whether there was some change
> or
> > > >>>> not (in which case OSM_SIGNAL_DONE should be returned).
> > > >>>>
> > > >>>> --
> > > >>>> Yevgeny
> > > >>>>
> > > >>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> > > >>>>
> > > >>>>
> > > >>> Good catch!
> > > >>>
> > > >>> Thanks. Applied.
> > > >>>
> > > >>> Is this issue (and patch or a similar one) also applicable to
> OFED 1.1 ?
> > > >>>
> > > >>>
> > > >> I think OFED 1.1 does not have the "incremental" routing patch.
> > > >>
> > > >
> > > > Right; it doesn't.
> > > >
> > > >
> > > >> So it does not have this bug.
> > > >>
> > > >
> > > > Are you sure that the incremental routing caused this to be needed
> ?
> > > > By any chance, are you confusing this with a different patch ?
> Just
> > > > want to be clear on this...
> > > >
> > > Yes I am sure. Without the new incremental feature every sweep all
> LFT
> > > tables were set.
> > 
> > That sounds like a different bug to me. Yevgeny's patch was for a hang
> which
> > involved issuing OSM_SIGNAL_DONE_PENDING rather than
> > OSM_SIGNAL_DONE. Is this related to incremental routing ?
> > 
> > -- Hal
> > 
> > > EZ
> > > > -- Hal
> > > >
> > > >
> > > >> EZ
> > > >>
> > > >>> -- Hal
> > > >>>
> > > >>>
> > > >>> _______________________________________________
> > > >>> openib-general mailing list
> > > >>> openib-general at openib.org
> > > >>> http://openib.org/mailman/listinfo/openib-general
> > > >>>
> > > >>> To unsubscribe, please visit
> > > >>> http://openib.org/mailman/listinfo/openib-general
> > > >>>
> > > >>>
> > > >
> > > >
> > > > _______________________________________________
> > > > openib-general mailing list
> > > > openib-general at openib.org
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > > > To unsubscribe, please visit
> > > > http://openib.org/mailman/listinfo/openib-general
> > > >
> > >
> > 
> 


From robert.j.woodruff at intel.com  Fri Dec 15 14:05:14 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Fri, 15 Dec 2006 14:05:14 -0800
Subject: [openib-general] OpenSM core dump - file size exceeded
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C01580D5D@orsmsx418.amr.corp.intel.com>

Hal wrote,
>Any idea what filled up the log ? but that's a side issue.

Yes we were getting a bunch of multicast errors, Sean is investigating
this. 

>This has been discussed on the list before. This is one option which
can
>help with this issue:

>        -L, --log_limit <size in MB>
>              This  option defines maximal log file size in MB. When
specified
>              the log file will be truncated upon reaching this limit.

Ok, thanks. 

woody


From swise at opengridcomputing.com  Fri Dec 15 14:50:17 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 15 Dec 2006 16:50:17 -0600
Subject: [openib-general] [PATCH] rdma_cm iWARP connection setup timeouts
 reported as rejects.
Message-ID: <20061215225017.22628.17881.stgit@dell3.ogc.int>


The IWCM should report timeouts as event RDMA_CM_EVENT_UNREACHABLE,
not event RDMA_CM_EVENT_REJECTED.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/cma.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index afd9383..5fdb9df 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1088,10 +1088,21 @@ static int cma_iw_handler(struct iw_cm_i
 		*sin = iw_event->local_addr;
 		sin = (struct sockaddr_in *) &id_priv->id.route.addr.dst_addr;
 		*sin = iw_event->remote_addr;
-		if (iw_event->status)
-			event.event = RDMA_CM_EVENT_REJECTED;
-		else
+		switch (iw_event->status) {
+		case 0:
 			event.event = RDMA_CM_EVENT_ESTABLISHED;
+			break;
+		case -ECONNRESET:
+		case -ECONNREFUSED:
+			event.event = RDMA_CM_EVENT_REJECTED;
+			break;
+		case -ETIMEDOUT:
+			event.event = RDMA_CM_EVENT_UNREACHABLE;
+			break;
+		default:
+			event.event = RDMA_CM_EVENT_CONNECT_ERROR;
+			break;
+		}
 		break;
 	case IW_CM_EVENT_ESTABLISHED:
 		event.event = RDMA_CM_EVENT_ESTABLISHED;


From swise at opengridcomputing.com  Fri Dec 15 14:56:06 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Fri, 15 Dec 2006 16:56:06 -0600
Subject: [openib-general] [PATCH] librdmacm Pass back the status or errno in
 RDMA CM events.
Message-ID: <20061215225606.22765.18276.stgit@dell3.ogc.int>


The librdmacm code isn't passing back the errno in all events.

For example, if a connection request times out the kernel CMA will pass
up event RDMA_CM_EVENT_UNREACHABLE with the status set to -ETIMEDOUT.
This errno isn't currently passed back to the librdmacm user in the event.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 src/cma.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/src/cma.c b/src/cma.c
index 5cffa18..850f381 100644
--- a/src/cma.c
+++ b/src/cma.c
@@ -1302,6 +1302,7 @@ retry:
 	default:
 		evt->id_priv = (void *) (uintptr_t) resp->uid;
 		evt->event.id = &evt->id_priv->id;
+		evt->event.status = resp->status;
 		if (evt->id_priv->id.ps == RDMA_PS_TCP)
 			ucma_copy_conn_event(evt, &resp->param.conn);
 		else


From sashak at voltaire.com  Fri Dec 15 16:37:28 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sat, 16 Dec 2006 02:37:28 +0200
Subject: [openib-general] [PATCH] osm: bug that caused ucast manager to
 'hang'
In-Reply-To: <1166217285.32666.579.camel@hal.voltaire.com>
References: <4581ACE5.9000109@dev.mellanox.co.il>
	<1166193153.28709.186595.camel@hal.voltaire.com>
	<4582D588.2070506@mellanox.co.il>
	<1166208365.28709.195843.camel@hal.voltaire.com>
	<4582FF9E.3040901@mellanox.co.il>
	<1166217285.32666.579.camel@hal.voltaire.com>
Message-ID: <1166229448.14664.19.camel@localhost>

On Fri, 2006-12-15 at 16:14 -0500, Hal Rosenstock wrote:
> On Fri, 2006-12-15 at 15:03, Eitan Zahavi wrote:
> > Hal Rosenstock wrote:
> > > On Fri, 2006-12-15 at 12:04, Eitan Zahavi wrote:
> > >   
> > >> Hal Rosenstock wrote:
> > >>     
> > >>> Hi again Yevgeny,
> > >>>
> > >>> On Thu, 2006-12-14 at 14:58, Yevgeny Kliteynik wrote:
> > >>>   
> > >>>       
> > >>>> Hi Hal
> > >>>>
> > >>>> This patch fixes a bug that caused ucast manager to return
> > >>>> OSM_SIGNAL_DONE_PENDING even if there are no pending transactions.
> > >>>> Added a boolean flag that marks whether there was some change or not
> > >>>> (in which case OSM_SIGNAL_DONE should be returned).
> > >>>>
> > >>>> --
> > >>>> Yevgeny
> > >>>>
> > >>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> > >>>>     
> > >>>>         
> > >>> Good catch!
> > >>>
> > >>> Thanks. Applied.
> > >>>
> > >>> Is this issue (and patch or a similar one) also applicable to OFED 1.1 ?
> > >>>   
> > >>>       
> > >> I think OFED 1.1 does not have the "incremental" routing patch.
> > >>     
> > >
> > > Right; it doesn't.
> > >
> > >   
> > >> So it does not have this bug.
> > >>     
> > >
> > > Are you sure that the incremental routing caused this to be needed ? By
> > > any chance, are you confusing this with a different patch ? Just want to
> > > be clear on this...
> > >   
> > Yes I am sure. Without the new incremental feature every sweep all LFT 
> > tables were set.
> 
> That sounds like a different bug to me. Yevgeny's patch was for a hang
> which involved issuing OSM_SIGNAL_DONE_PENDING rather than
> OSM_SIGNAL_DONE. Is this related to incremental routing ?

Before this LFT update request was always sent. So yes, it is related.

Sasha

> 
> -- Hal
> 
> > EZ
> > > -- Hal
> > >
> > >   
> > >> EZ
> > >>     
> > >>> -- Hal
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> openib-general mailing list
> > >>> openib-general at openib.org
> > >>> http://openib.org/mailman/listinfo/openib-general
> > >>>
> > >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > >>>   
> > >>>       
> > >
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > >   
> > 
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Fri Dec 15 17:32:32 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 15 Dec 2006 20:32:32 -0500
Subject: [openib-general] [PATCH][TRIVIAL] OpenSM/osm_subnet.c: Fix
 sminfo_polling_timeout comment in opensm.opts
Message-ID: <1166232721.32666.12476.camel@hal.voltaire.com>

OpenSM/osm_subnet.c: Fix sminfo_polling_timeout comment in opensm.opts

sminfo_polling_timeout in msecs rather than secs

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_subnet.c b/osm/opensm/osm_subnet.c
index 3db4612..da82471 100644
--- a/osm/opensm/osm_subnet.c
+++ b/osm/opensm/osm_subnet.c
@@ -1175,7 +1175,7 @@ osm_subn_write_conf_file(
     "sm_priority %u\n\n"
     "# If TRUE other SMs on the subnet should be ignored\n"
     "ignore_other_sm %s\n\n"
-    "# Timeout in [sec] between two polls of active master SM\n"
+    "# Timeout in [msec] between two polls of active master SM\n"
     "sminfo_polling_timeout %u\n\n"
     "# Number of failing polls of remote SM that declares it dead\n"
     "polling_retry_number %u\n\n"


From rdreier at cisco.com  Fri Dec 15 20:57:29 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 15 Dec 2006 20:57:29 -0800
Subject: [openib-general] [GIT PULL] please pull infiniband.git
Message-ID: <adawt4sigjq.fsf@cisco.com>

Linus, please pull from

    master.kernel.org:/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

This tree is also available from kernel.org mirrors at:

    git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git for-linus

A couple of fixes for semi-nasty bugs on 32-bit architectures, plus
one small mthca driver update:

Leonid Arsh (1):
      IB/mthca: Add HCA profile module parameters

Roland Dreier (3):
      IB: Fix ib_dma_alloc_coherent() wrapper
      IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G
      IB/mthca: Use DEFINE_MUTEX() instead of mutex_init()

 drivers/infiniband/hw/mthca/mthca_main.c |  113 +++++++++++++++++++++++++----
 drivers/infiniband/ulp/srp/ib_srp.c      |    2 +-
 drivers/infiniband/ulp/srp/ib_srp.h      |    2 +-
 include/rdma/ib_verbs.h                  |    9 ++-
 4 files changed, 107 insertions(+), 19 deletions(-)


diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 0491ec7..44bc6cc 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -80,24 +80,61 @@ static int tune_pci = 0;
 module_param(tune_pci, int, 0444);
 MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if nonzero");
 
-struct mutex mthca_device_mutex;
+DEFINE_MUTEX(mthca_device_mutex);
+
+#define MTHCA_DEFAULT_NUM_QP            (1 << 16)
+#define MTHCA_DEFAULT_RDB_PER_QP        (1 << 2)
+#define MTHCA_DEFAULT_NUM_CQ            (1 << 16)
+#define MTHCA_DEFAULT_NUM_MCG           (1 << 13)
+#define MTHCA_DEFAULT_NUM_MPT           (1 << 17)
+#define MTHCA_DEFAULT_NUM_MTT           (1 << 20)
+#define MTHCA_DEFAULT_NUM_UDAV          (1 << 15)
+#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18)
+#define MTHCA_DEFAULT_NUM_UARC_SIZE     (1 << 18)
+
+static struct mthca_profile hca_profile = {
+	.num_qp             = MTHCA_DEFAULT_NUM_QP,
+	.rdb_per_qp         = MTHCA_DEFAULT_RDB_PER_QP,
+	.num_cq             = MTHCA_DEFAULT_NUM_CQ,
+	.num_mcg            = MTHCA_DEFAULT_NUM_MCG,
+	.num_mpt            = MTHCA_DEFAULT_NUM_MPT,
+	.num_mtt            = MTHCA_DEFAULT_NUM_MTT,
+	.num_udav           = MTHCA_DEFAULT_NUM_UDAV,          /* Tavor only */
+	.fmr_reserved_mtts  = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */
+	.uarc_size          = MTHCA_DEFAULT_NUM_UARC_SIZE,     /* Arbel only */
+};
+
+module_param_named(num_qp, hca_profile.num_qp, int, 0444);
+MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA");
+
+module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444);
+MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP");
+
+module_param_named(num_cq, hca_profile.num_cq, int, 0444);
+MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA");
+
+module_param_named(num_mcg, hca_profile.num_mcg, int, 0444);
+MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA");
+
+module_param_named(num_mpt, hca_profile.num_mpt, int, 0444);
+MODULE_PARM_DESC(num_mpt,
+		"maximum number of memory protection table entries per HCA");
+
+module_param_named(num_mtt, hca_profile.num_mtt, int, 0444);
+MODULE_PARM_DESC(num_mtt,
+		 "maximum number of memory translation table segments per HCA");
+
+module_param_named(num_udav, hca_profile.num_udav, int, 0444);
+MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA");
+
+module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444);
+MODULE_PARM_DESC(fmr_reserved_mtts,
+		 "number of memory translation table segments reserved for FMR");
 
 static const char mthca_version[] __devinitdata =
 	DRV_NAME ": Mellanox InfiniBand HCA driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
-static struct mthca_profile default_profile = {
-	.num_qp		   = 1 << 16,
-	.rdb_per_qp	   = 4,
-	.num_cq		   = 1 << 16,
-	.num_mcg	   = 1 << 13,
-	.num_mpt	   = 1 << 17,
-	.num_mtt	   = 1 << 20,
-	.num_udav	   = 1 << 15,	/* Tavor only */
-	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
-	.uarc_size	   = 1 << 18,	/* Arbel only */
-};
-
 static int mthca_tune_pci(struct mthca_dev *mdev)
 {
 	int cap;
@@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev)
 		goto err_disable;
 	}
 
-	profile = default_profile;
+	profile = hca_profile;
 	profile.num_uar   = dev_lim.uar_size / PAGE_SIZE;
 	profile.uarc_size = 0;
 	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
@@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev)
 		goto err_stop_fw;
 	}
 
-	profile = default_profile;
+	profile = hca_profile;
 	profile.num_uar  = dev_lim.uar_size / PAGE_SIZE;
 	profile.num_udav = 0;
 	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
@@ -1278,11 +1315,55 @@ static struct pci_driver mthca_driver = {
 	.remove		= __devexit_p(mthca_remove_one)
 };
 
+static void __init __mthca_check_profile_val(const char *name, int *pval,
+					     int pval_default)
+{
+	/* value must be positive and power of 2 */
+	int old_pval = *pval;
+
+	if (old_pval <= 0)
+		*pval = pval_default;
+	else
+		*pval = roundup_pow_of_two(old_pval);
+
+	if (old_pval != *pval) {
+		printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n",
+		       old_pval, name);
+		printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval);
+	}
+}
+
+#define mthca_check_profile_val(name, default)				\
+	__mthca_check_profile_val(#name, &hca_profile.name, default)
+
+static void __init mthca_validate_profile(void)
+{
+	mthca_check_profile_val(num_qp,            MTHCA_DEFAULT_NUM_QP);
+	mthca_check_profile_val(rdb_per_qp,        MTHCA_DEFAULT_RDB_PER_QP);
+	mthca_check_profile_val(num_cq,            MTHCA_DEFAULT_NUM_CQ);
+	mthca_check_profile_val(num_mcg, 	   MTHCA_DEFAULT_NUM_MCG);
+	mthca_check_profile_val(num_mpt, 	   MTHCA_DEFAULT_NUM_MPT);
+	mthca_check_profile_val(num_mtt, 	   MTHCA_DEFAULT_NUM_MTT);
+	mthca_check_profile_val(num_udav,          MTHCA_DEFAULT_NUM_UDAV);
+	mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS);
+
+	if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) {
+		printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n",
+		       hca_profile.fmr_reserved_mtts);
+		printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n",
+		       hca_profile.num_mtt);
+		hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2;
+		printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n",
+		       hca_profile.fmr_reserved_mtts);
+	}
+}
+
 static int __init mthca_init(void)
 {
 	int ret;
 
-	mutex_init(&mthca_device_mutex);
+	mthca_validate_profile();
+
 	ret = mthca_catas_init();
 	if (ret)
 		return ret;
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index e9b6a6f..cdecbf5 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1898,7 +1898,7 @@ static void srp_add_one(struct ib_device *device)
 	 */
 	srp_dev->fmr_page_shift = max(9, ffs(dev_attr->page_size_cap) - 1);
 	srp_dev->fmr_page_size  = 1 << srp_dev->fmr_page_shift;
-	srp_dev->fmr_page_mask  = ~((unsigned long) srp_dev->fmr_page_size - 1);
+	srp_dev->fmr_page_mask  = ~((u64) srp_dev->fmr_page_size - 1);
 
 	INIT_LIST_HEAD(&srp_dev->dev_list);
 
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h
index 868a540..c217723 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -87,7 +87,7 @@ struct srp_device {
 	struct ib_fmr_pool     *fmr_pool;
 	int			fmr_page_shift;
 	int			fmr_page_size;
-	unsigned long		fmr_page_mask;
+	u64			fmr_page_mask;
 };
 
 struct srp_host {
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3c2e105..0bfa332 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1639,7 +1639,14 @@ static inline void *ib_dma_alloc_coherent(struct ib_device *dev,
 {
 	if (dev->dma_ops)
 		return dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag);
-	return dma_alloc_coherent(dev->dma_device, size, dma_handle, flag);
+	else {
+		dma_addr_t handle;
+		void *ret;
+
+		ret = dma_alloc_coherent(dev->dma_device, size, &handle, flag);
+		*dma_handle = handle;
+		return ret;
+	}
 }
 
 /**


From rdreier at cisco.com  Fri Dec 15 21:04:23 2006
From: rdreier at cisco.com (Roland Dreier)
Date: Fri, 15 Dec 2006 21:04:23 -0800
Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module
	parameters
In-Reply-To: <457BF221.8080701@voltaire.com> (Moni Shoua's message of
	"Sun, 10 Dec 2006 13:40:17 +0200")
References: <457BF221.8080701@voltaire.com>
Message-ID: <adaodq4ig88.fsf@cisco.com>

OK, the patch below is what I ended up committing.  I am really not
pleased with the patch you sent and expected me to include -- there
are really obvious simple-to-fix things that it's just ridiculous for
you to be sending, eg:

 > +MODULE_PARM_DESC(num_mpt, 

trailing whitespace -- please check that your patch applies with 'git
apply --check --whitespace=error-all'

 > +		"maximum number of memory protection pable entries per HCA");

umm, 'pable'??

and plenty of other things...

For some reason I felt guilty about letting this patch hang for so
long, and so I fixed it up, but after doing it this time, I'm not
going to spend my time like that again.  I have plenty of work to do
without cleaning up other people's messes...

    IB/mthca: Add HCA profile module parameters
    
    Add module parameters that enable settting some of the HCA
    profile values, such as the number of QPs, CQs, etc.
    
    Signed-off-by: Leonid Arsh <leonida at voltaire.com>
    Signed-off-by: Moni Shoua <monis at voltaire.com>
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 0491ec7..711c1b8 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -82,22 +82,59 @@ MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if n
 
 struct mutex mthca_device_mutex;
 
+#define MTHCA_DEFAULT_NUM_QP            (1 << 16)
+#define MTHCA_DEFAULT_RDB_PER_QP        (1 << 2)
+#define MTHCA_DEFAULT_NUM_CQ            (1 << 16)
+#define MTHCA_DEFAULT_NUM_MCG           (1 << 13)
+#define MTHCA_DEFAULT_NUM_MPT           (1 << 17)
+#define MTHCA_DEFAULT_NUM_MTT           (1 << 20)
+#define MTHCA_DEFAULT_NUM_UDAV          (1 << 15)
+#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18)
+#define MTHCA_DEFAULT_NUM_UARC_SIZE     (1 << 18)
+
+static struct mthca_profile hca_profile = {
+	.num_qp             = MTHCA_DEFAULT_NUM_QP,
+	.rdb_per_qp         = MTHCA_DEFAULT_RDB_PER_QP,
+	.num_cq             = MTHCA_DEFAULT_NUM_CQ,
+	.num_mcg            = MTHCA_DEFAULT_NUM_MCG,
+	.num_mpt            = MTHCA_DEFAULT_NUM_MPT,
+	.num_mtt            = MTHCA_DEFAULT_NUM_MTT,
+	.num_udav           = MTHCA_DEFAULT_NUM_UDAV,          /* Tavor only */
+	.fmr_reserved_mtts  = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */
+	.uarc_size          = MTHCA_DEFAULT_NUM_UARC_SIZE,     /* Arbel only */
+};
+
+module_param_named(num_qp, hca_profile.num_qp, int, 0444);
+MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA");
+
+module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444);
+MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP");
+
+module_param_named(num_cq, hca_profile.num_cq, int, 0444);
+MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA");
+
+module_param_named(num_mcg, hca_profile.num_mcg, int, 0444);
+MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA");
+
+module_param_named(num_mpt, hca_profile.num_mpt, int, 0444);
+MODULE_PARM_DESC(num_mpt,
+		"maximum number of memory protection table entries per HCA");
+
+module_param_named(num_mtt, hca_profile.num_mtt, int, 0444);
+MODULE_PARM_DESC(num_mtt,
+		 "maximum number of memory translation table segments per HCA");
+
+module_param_named(num_udav, hca_profile.num_udav, int, 0444);
+MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA");
+
+module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444);
+MODULE_PARM_DESC(fmr_reserved_mtts,
+		 "number of memory translation table segments reserved for FMR");
+
 static const char mthca_version[] __devinitdata =
 	DRV_NAME ": Mellanox InfiniBand HCA driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
-static struct mthca_profile default_profile = {
-	.num_qp		   = 1 << 16,
-	.rdb_per_qp	   = 4,
-	.num_cq		   = 1 << 16,
-	.num_mcg	   = 1 << 13,
-	.num_mpt	   = 1 << 17,
-	.num_mtt	   = 1 << 20,
-	.num_udav	   = 1 << 15,	/* Tavor only */
-	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
-	.uarc_size	   = 1 << 18,	/* Arbel only */
-};
-
 static int mthca_tune_pci(struct mthca_dev *mdev)
 {
 	int cap;
@@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev)
 		goto err_disable;
 	}
 
-	profile = default_profile;
+	profile = hca_profile;
 	profile.num_uar   = dev_lim.uar_size / PAGE_SIZE;
 	profile.uarc_size = 0;
 	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
@@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev)
 		goto err_stop_fw;
 	}
 
-	profile = default_profile;
+	profile = hca_profile;
 	profile.num_uar  = dev_lim.uar_size / PAGE_SIZE;
 	profile.num_udav = 0;
 	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
@@ -1278,11 +1315,57 @@ static struct pci_driver mthca_driver = {
 	.remove		= __devexit_p(mthca_remove_one)
 };
 
+static void __init __mthca_check_profile_val(const char *name, int *pval,
+					     int pval_default)
+{
+	/* value must be positive and power of 2 */
+	int old_pval = *pval;
+
+	if (old_pval <= 0)
+		*pval = pval_default;
+	else
+		*pval = roundup_pow_of_two(old_pval);
+
+	if (old_pval != *pval) {
+		printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n",
+		       old_pval, name);
+		printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval);
+	}
+}
+
+#define mthca_check_profile_val(name, default)				\
+	__mthca_check_profile_val(#name, &hca_profile.name, default)
+
+static void __init mthca_validate_profile(void)
+{
+	mthca_check_profile_val(num_qp,            MTHCA_DEFAULT_NUM_QP);
+	mthca_check_profile_val(rdb_per_qp,        MTHCA_DEFAULT_RDB_PER_QP);
+	mthca_check_profile_val(num_cq,            MTHCA_DEFAULT_NUM_CQ);
+	mthca_check_profile_val(num_mcg, 	   MTHCA_DEFAULT_NUM_MCG);
+	mthca_check_profile_val(num_mpt, 	   MTHCA_DEFAULT_NUM_MPT);
+	mthca_check_profile_val(num_mtt, 	   MTHCA_DEFAULT_NUM_MTT);
+	mthca_check_profile_val(num_udav,          MTHCA_DEFAULT_NUM_UDAV);
+	mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS);
+
+	if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) {
+		printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n",
+		       hca_profile.fmr_reserved_mtts);
+		printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n",
+		       hca_profile.num_mtt);
+		hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2;
+		printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n",
+		       hca_profile.fmr_reserved_mtts);
+	}
+}
+
 static int __init mthca_init(void)
 {
 	int ret;
 
 	mutex_init(&mthca_device_mutex);
+
+	mthca_validate_profile();
+
 	ret = mthca_catas_init();
 	if (ret)
 		return ret;


From jsquyres at cisco.com  Fri Dec 15 21:30:55 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Sat, 16 Dec 2006 00:30:55 -0500
Subject: [openib-general] <new>.openfabrics.org names
In-Reply-To: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com>
References: <55CE0347B98FCA468923E5FBC25CB4DC4097DD@orsmsx413.amr.corp.intel.com>
Message-ID: <C65FEA0A-DC60-4D19-AA5B-78CADA76A9BE@cisco.com>

I think the question is -- who has the godaddy password?  If Michael  
is the only one who has it, can someone contact him to get it?  Jim?


On Dec 15, 2006, at 12:17 PM, Ryan, Jim wrote:

> Michael has done this in the past but he's on sabbatical and  
> unavailable
> for several weeks. Can someone else do this?
>
> Thanks, Jim
>
> -----Original Message-----
> From: Matt Leininger [mailto:mlleinin at hpcn.ca.sandia.gov]
> Sent: Friday, December 15, 2006 9:15 AM
> To: Jeff Squyres
> Cc: openib; Ryan, Jim; Oros, Michael
> Subject: Re: [openib-general] <new>.openfabrics.org names
>
> On Fri, 2006-12-15 at 08:17 -0500, Jeff Squyres wrote:
>> These names still don't appear to exist.  Do we know when they'll be
>> created?
>
>   Intel controls the openfabrics.org domain name.  I think Jim or
> Michael can make this happen.
>
>   - Matt
>
>>
>>
>> On Dec 4, 2006, at 2:00 PM, Jeff Squyres wrote:
>>
>>> Who controls the DNS for openfabrics.org?  Could we get these names
>>> created?  Or -- are there any objections to creating / using such
>>> names?
>>>
>>> Thanks!
>>>
>>>
>>> On Nov 28, 2006, at 10:54 AM, Jeff Squyres wrote:
>>>
>>>> The name "staging.openfabrics.org" was really intended to be
>>>> temporary until the old openfabrics.org was taken offline and
>>>> replaced with the new one.
>>>>
>>>> My $0.02 is that we should stop using staging.openfabrics.org as
>>>> soon as possible and create / start using some new names for the
>>>> server to allow for potential transparent service relocation
> someday.
>>>>
>>>> Here are some new name suggestions that could be done immediately
>>>> (with appropriate changes to DNS, apache config, ...and potentially
>>>> others):
>>>>
>>>>  * git.openfabrics.org: for all git activity
>>>>  * wiki.openfabrics.org: a top-level name for the wiki rather than
>>>> burying it under several layers of links on the web site
>>>>  * trac.openfabrics.org: if someone creates this name, I volunteer
>>>> to finally get off my butt and install trac to see if people like
> it
>>>>
>>>> These are the old names and would need to be changed in DNS only
>>>> when the old server is taken offline / we're ready to move to the
>>>> new server:
>>>>
>>>>  * openfabrics.org: redirect to www.openfabrics.org, and for mail
>>>> traffic
>>>>  * www.openfabrics.org: main web site
>>>>
>>>> -- 
>>>> Jeff Squyres
>>>> Server Virtualization Business Unit
>>>> Cisco Systems
>>>>
>>>>
>>>
>>>
>>> -- 
>>> Jeff Squyres
>>> Server Virtualization Business Unit
>>> Cisco Systems
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/
>>> openib-general
>>
>>


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From dotanb at dev.mellanox.co.il  Fri Dec 15 23:29:36 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Sat, 16 Dec 2006 09:29:36 +0200 (IST)
Subject: [openib-general] can i use the multicast module in user level?
In-Reply-To: <4582DECD.70301@ichips.intel.com>
References: <3840.85.65.224.66.1166199920.squirrel@dev.mellanox.co.il>
	<4582DECD.70301@ichips.intel.com>
Message-ID: <1539.85.65.224.140.1166254176.squirrel@dev.mellanox.co.il>

>> I would like to use the multicast module in user level tests (in order
>> to
>> send a join message to the multicast groups that I'm using).
>>
>> Can I use the multicast module in user level?
>> (if the answer is yes, is there is any code reference that I can use?)
>
> Multicast support has only been exposed to userspace through the
> librdmacm.
> There's a mckey test app that shows how this can be used.
>
> I will be working on a raw IB multicast / InformInfo userspace support
> through
> January.  There is an older userspace SA library that you might be able to
> play
> with as well, but you'd have to look back through the mail logs to find
> the patches.

Thank you very much.

I think i will wait until the raw IB multicast support will be ready; you
have a waiting customer ..
;)

thanks
Dotan


From eitan at mellanox.co.il  Sat Dec 16 14:12:28 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 17 Dec 2006 00:12:28 +0200
Subject: [openib-general] [PATCH] osm: fix bugs related to not passing
 OSM_SIGNAL_DONE_PENDING
Message-ID: <45846F4C.4080501@mellanox.co.il>

Hi Hal

This set of patches fixes issues of not providing back to state manager 
OSM_SIGNAL_DONE_PENDING
which breaks the state machine later in the sweep.

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

 osm/opensm/osm_pkey_mgr.c  |  112 
++++++++++++++++++++++++++++++++------------
osm/opensm/osm_state_mgr.c |   11 +++--
 osm/opensm/osm_ucast_mgr.c |   96 ++++++++++++++++++++++++--------------
 4 files changed, 179 insertions(+), 88 deletions(-)

diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index 48837bc..a33aec7 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
 
 /**********************************************************************
  **********************************************************************/
-static ib_api_status_t
+static boolean_t
 pkey_mgr_enforce_partition(
+  IN osm_log_t *p_log,
   IN const osm_req_t *p_req,
   IN const osm_physp_t *p_physp,
   IN const boolean_t enforce)
@@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
   osm_madw_context_t context;
   uint8_t payload[IB_SMP_DATA_SIZE];
   ib_port_info_t *p_pi;
+  ib_api_status_t status;
 
   if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
-    return IB_ERROR;
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0507: "
+              "No port info for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
 
-  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
-    return IB_SUCCESS;
+  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "No need to update PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+    return FALSE;
+  }
 
   memset( payload, 0, IB_SMP_DATA_SIZE );
   memcpy( payload, p_pi, sizeof(ib_port_info_t) );
@@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
   context.pi_context.light_sweep = FALSE;
   context.pi_context.active_transition = FALSE;
 
-  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
-                      payload, sizeof(payload),
-                      IB_MAD_ATTR_PORT_INFO,
-                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
-                      CL_DISP_MSGID_NONE, &context );
+  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
+        payload, sizeof(payload),
+        IB_MAD_ATTR_PORT_INFO,
+        cl_hton32( osm_physp_get_port_num( p_physp ) ),
+        CL_DISP_MSGID_NONE, &context );
+  if (status != IB_SUCCESS)
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0520: "
+              "Failed to set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
+  else
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "Set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+   return TRUE;
+  }
 }
 
 /**********************************************************************
@@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
block_index );
     if (status == IB_SUCCESS)
-      ret_val = TRUE;
+  {
+   osm_log( p_log, OSM_LOG_DEBUG,
+      "pkey_mgr_update_port: "
+      "Updated "
+      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+      block_index,
+      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+      osm_physp_get_port_num( p_physp ) );
+   ret_val = TRUE;
+  }
     else
-      osm_log( p_log, OSM_LOG_ERROR,
-        "pkey_mgr_update_port: ERR 0506: "
-        "pkey_mgr_update_pkey_entry() failed to update "
-        "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-        block_index,
-        cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-        osm_physp_get_port_num( p_physp ) );
+  {
+   osm_log( p_log, OSM_LOG_ERROR,
+      "pkey_mgr_update_port: ERR 0506: "
+      "pkey_mgr_update_pkey_entry() failed to update "
+      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+      block_index,
+      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+      osm_physp_get_port_num( p_physp ) );
+  }
   }
 
   return ret_val;
@@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
   uint16_t peer_max_blocks;
   ib_api_status_t status = IB_SUCCESS;
   boolean_t ret_val = FALSE;
+  boolean_t port_info_set = FALSE;
   ib_pkey_table_t empty_block;
-
+ 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
   p_physp = osm_port_get_default_phys_ptr( p_port );
@@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
     enforce = FALSE;
   }
 
-  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-      "pkey_mgr_update_peer_port: ERR 0507: "
-      "pkey_mgr_enforce_partition() failed to update "
-      "node 0x%016" PRIx64 " port %u\n",
-      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-      osm_physp_get_port_num( peer ) );
-  }
+  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
+   port_info_set = TRUE;
 
   if (enforce == FALSE)
-    return FALSE;
+  return port_info_set;
 
   p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
   for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
block_index++)
@@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
              osm_physp_get_port_num( peer ) );
   }
 
+  if (port_info_set) return TRUE;
   return ret_val;
 }
 
@@ -541,10 +593,10 @@ osm_pkey_mgr_process(
       signal = OSM_SIGNAL_DONE_PENDING;
     p_node = osm_port_get_parent_node( p_port );
     if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
-  pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+   pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
         &p_osm->subn, p_port,
         !p_osm->subn.opt.no_partition_enforcement ) )
-      signal = OSM_SIGNAL_DONE_PENDING;       
+      signal = OSM_SIGNAL_DONE_PENDING;
   }
 
  _err:
diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 9eac038..4e61259 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1853,6 +1853,7 @@ osm_state_mgr_process(
 {
    ib_api_status_t status;
    osm_remote_sm_t *p_remote_sm;
+ osm_signal_t tmp_signal;
 
    CL_ASSERT( p_mgr );
 
@@ -2075,11 +2076,10 @@ osm_state_mgr_process(
          case OSM_SIGNAL_CHANGE_DETECTED:
             /*
              * Nothing to do here.  One subnet change typcially
-             * begets another....
+             * begets another.... But needs to wait for all transactions
              */
             signal = OSM_SIGNAL_NONE;
-            break;
-
+    break;
          case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
             /*
              * A change was detected on the subnet.
@@ -2219,7 +2219,10 @@ osm_state_mgr_process(
             signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
 
             /* the returned signal is always DONE */
-            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+
+    if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
+     signal = OSM_SIGNAL_DONE_PENDING;
 
             /* try to restore SA DB (this should be before lid_mgr
                because we may want to disable clients reregistration
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index e977253..39973de 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table(
   ib_switch_info_t si;
   uint32_t block_id_ho = 0;
   uint8_t block[IB_SMP_DATA_SIZE];
+  boolean_t set_swinfo_require = FALSE;
+  uint16_t lin_top;
+  uint8_t life_state;
 
   CL_ASSERT( p_mgr );
 
@@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table(
     Set the top of the unicast forwarding table.
   */
   si = *osm_switch_get_si_ptr( p_sw );
-  si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  if (si.lin_top != lin_top)
+  {
+   set_swinfo_require = TRUE;
+      si.lin_top  = lin_top;
+  }
 
   /* check to see if the change state bit is on. If it is - then we
      need to clear it. */
-   if( ib_switch_info_get_state_change( &si ) )
-    si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
-                      | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
+  if ( ib_switch_info_get_state_change( &si ) )
+      life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
+                          | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
   else
-    si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
+      life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
 
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+  if (life_state != si.life_state)
   {
-    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-             "osm_ucast_mgr_set_fwd_table: "
-             "Setting switch FT top to LID 0x%X\n",
-             osm_switch_get_max_lid_ho( p_sw ) );
+      set_swinfo_require = TRUE;
+      si.life_state = life_state;
   }
-
-  context.si_context.light_sweep = FALSE;
-  context.si_context.node_guid = osm_node_get_node_guid( p_node );
-  context.si_context.set_method = TRUE;
-
-  status = osm_req_set( p_mgr->p_req,
-                        p_path,
-                        (uint8_t*)&si,
-                        sizeof(si),
-                        IB_MAD_ATTR_SWITCH_INFO,
-                        0,
-                        CL_DISP_MSGID_NONE,
-                        &context );
-
-  if( status != IB_SUCCESS )
+ 
+  if ( set_swinfo_require )
   {
-    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
-             "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
-             "Sending SwitchInfo attribute failed (%s)\n",
-             ib_get_err_str( status ) );
+      if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+      {
+          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                      "osm_ucast_mgr_set_fwd_table: "
+                      "Setting switch FT top to LID 0x%X\n",
+                      osm_switch_get_max_lid_ho( p_sw ) );
+      }
+     
+      context.si_context.light_sweep = FALSE;
+      context.si_context.node_guid = osm_node_get_node_guid( p_node );
+      context.si_context.set_method = TRUE;
+     
+      status = osm_req_set( p_mgr->p_req,
+                                    p_path,
+                                    (uint8_t*)&si,
+                                    sizeof(si),
+                                    IB_MAD_ATTR_SWITCH_INFO,
+                                    0,
+                                    CL_DISP_MSGID_NONE,
+                                    &context );
+     
+      if( status != IB_SUCCESS )
+      {
+          osm_log( p_mgr->p_log, OSM_LOG_ERROR,
+                      "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
+                      "Sending SwitchInfo attribute failed (%s)\n",
+                      ib_get_err_str( status ) );
+      }
+      else
+          p_mgr->any_change = TRUE;
   }
 
   /*
@@ -1215,13 +1234,14 @@ osm_ucast_mgr_process(
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
+  p_mgr->any_change = FALSE;
+
   /*
     If there are no switches in the subnet, we are done.
   */
   if (cl_qmap_count( p_sw_guid_tbl ) == 0)
     goto Exit;
 
-  p_mgr->any_change = FALSE;
   cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
 
   if (!p_routing_eng->build_lid_matrices ||
@@ -1248,14 +1268,20 @@ osm_ucast_mgr_process(
   if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
     __osm_ucast_mgr_dump_tables( p_mgr );
 
-  if (p_mgr->any_change)
+  if (p_mgr->any_change)
+  {
      signal = OSM_SIGNAL_DONE_PENDING;
+      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+                 "osm_ucast_mgr_process: "
+                 "LFT Tables configured on all switches\n");
+  }
   else
+  {
+      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+                 "osm_ucast_mgr_process: "
+                 "No need to set any LFT Tables on all switches\n");
      signal = OSM_SIGNAL_DONE;
-
-  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
-          "osm_ucast_mgr_process: "
-          "LFT Tables configured on all switches\n");
+  }
 
  Exit:
   CL_PLOCK_RELEASE( p_mgr->p_lock );


From eitan at mellanox.co.il  Sat Dec 16 10:56:39 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 16 Dec 2006 20:56:39 +0200
Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending
 transaction of Light Sweep
Message-ID: <45844167.9060302@mellanox.co.il>

Hi Hal

This patch provides fixes an issue discovered by the nightly regression.
OpenSM state machine got stack due to pending SwitchInfo transaction 
being ignored since one of the queries for SwitchInfo
failed (due to bad-link).
The patch below simply avoids aborting the wait for all SwitchInfo 
requests to return.

I think this issue might have hurt us in other situations too sine it 
aborted the wait on "CHANGE DETECTED" too.
CHANGE_DETECTED is fired on the first switch that reported "Change Bit".

It is possible that the issue is showing up as we added incremental 
support (e.g. for routing)
Since only of there are no other SMP's sent during the heavy sweep we 
will get the
"NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests

Eitan

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il >

 osm/opensm/osm_state_mgr.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 9eac038..91d9dbd 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -2075,11 +2075,10 @@ osm_state_mgr_process(
          case OSM_SIGNAL_CHANGE_DETECTED:
             /*
              * Nothing to do here.  One subnet change typcially
-             * begets another....
+             * begets another.... But needs to wait for all transactions to
+             * complete
              */
-            signal = OSM_SIGNAL_NONE;
             break;
-
          case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
             /*
              * A change was detected on the subnet.


From mst at mellanox.co.il  Sat Dec 16 09:03:28 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 16 Dec 2006 19:03:28 +0200
Subject: [openib-general] Fw: openib-general Digest, Vol 30, Issue 135
In-Reply-To: <OF163DE11D.BB337468-ON85257245.0067968E-85257245.00688586@us.ibm.com>
References: <OF163DE11D.BB337468-ON85257245.0067968E-85257245.00688586@us.ibm.com>
Message-ID: <20061216170328.GB24716@mellanox.co.il>

First, please avoid answering the digest. This breaks threading in most mailers.

> > > >
> > > > Tried this patch, it didn't work on ehca. I couldn't change the mode from
> > > > datagram to connected from /sys/class.
> > >
> > > It's wroking as designed in that respect.  ehca does not implement
> > > srq - without
> > > srq, there is no way to prepost receive buffers for a resonable number of
> > > connections without running out of memory.
> > >
> > > So it is falling back on datagram mode.
> > > Talk to ehca guys to implement srq and connected mode will be enabled.
> > Don't remember SRQ is a MUST for UC mode. Does this patch support
> > devices with SRQ in RC mode?
> 
> I don't think the IB HCA Spec requires SRQ support for RC but is an optional
> feature. There are two adapters right now that don't support SRQ which means to
> use IPoIB-CM on them you should make the use of SRQ an option setting.

No, adding such "drink up all memory on real clusters but run well on a back to back
benchmark platform" option does not seem like a good idea to me.
Rather, we should use UD mode to keep IPoIB scalable on all hardware.

> I agree
> that if it is available it should be used for scaling issues probably if
> available automatically set. But I would like to see us at least support the
> current hardware that meets the current SPEC.

SRQ support is clearly optional. But neither is IPoIB CM support a required
feature. Current code will fall back to datagram mode when SRQ is not
supported, and since UD support in not optional, all current hardware is still
supported with IPoIB - this patch does not break this.

-- 
MST


From mst at mellanox.co.il  Sat Dec 16 08:47:09 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sat, 16 Dec 2006 18:47:09 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <OFC0CE579C.BE8985BF-ON87257245.005D816F-88257245.005DEFD2@us.ibm.com>
References: <20061215051438.GH19449@mellanox.co.il>
	<OFC0CE579C.BE8985BF-ON87257245.005D816F-88257245.005DEFD2@us.ibm.com>
Message-ID: <20061216164709.GA24716@mellanox.co.il>

> > > Hi, Michael,
> > >
> > > Tried this patch, it didn't work on ehca. I couldn't change the mode from
> > > datagram to connected from /sys/class.
> >
> > It's wroking as designed in that respect.  ehca does not implement
> > srq - without
> > srq, there is no way to prepost receive buffers for a resonable number of
> > connections without running out of memory.
> >
> > So it is falling back on datagram mode.
> > Talk to ehca guys to implement srq and connected mode will be enabled.
>
> Don't remember SRQ is a MUST for UC mode. Does this patch support devices with
> SRQ in RC mode?

Yes. Only RC mode is supported by this patch.
>From what you say I am guessing that SRQ is supported by ehca HW but support
is currently lacking in the ehca driver?

> > > And when unloading ib_ipoib module, all the connections to that node gone,
> > > rmmod ib_ipoib hung. The kernel is 2.6.19.
> >
> > Probably a bug in error handling somewhere.
> > Post the sysrq t trace and I'll take a look.
> 
> I will recreate the problem and post stack trace later.
> 
> Thanks
> Shirley Ma


-- 
MST


From dotanb at dev.mellanox.co.il  Sun Dec 17 02:03:40 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 17 Dec 2006 12:03:40 +0200
Subject: [openib-general] what should happen in a completion event
 channel is being destroyed when there are several CQs associated to it?
In-Reply-To: <21986.194.90.237.34.1163686615.squirrel@dev.mellanox.co.il>
References: <4553480F.80000@dev.mellanox.co.il> <adalkmkcrr9.fsf@cisco.com>
	<21986.194.90.237.34.1163686615.squirrel@dev.mellanox.co.il>
Message-ID: <458515FC.5050900@dev.mellanox.co.il>

Hi Roland.

dotanb at dev.mellanox.co.il wrote:
> Hi roland.
>
>   
>>  > What should happen in a completion event channel is being destroyed
>>  > when there are several CQs associated to it?
>>  > Should this operation fail (return EBUSY)?
>>
>> I think that would be the most consistent thing, since we return EBUSY
>> for example if a CQ is destroyed with QPs still attached.
>>
>>  > When i tried to do it and later on try to wait for a completion on
>>  > this event channel i got seg fault...
>>
>> Does the destroy succeed?
>>
>> Anyway I'll look at this code to see if it seems OK.
>>
>>  - R.
>>
>>     
> I'm writing the man pages to this verb, so which behaviour should i write
> the current behaviour or the future behaviour?
>
> for now, i'm writing the current behaviour.
>
> thanks
> Dotan
>   

Is there is any update with this issue?
(if the answer is no, do you plan to change this behavior?)

thanks
Dotan


From tziporet at dev.mellanox.co.il  Sun Dec 17 02:50:55 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Sun, 17 Dec 2006 12:50:55 +0200
Subject: [openib-general] [PATCH 5/5 v3] 2.6.20 rdma/cma: export rdma cm
 interface to userspace
In-Reply-To: <4582DD77.8090208@ichips.intel.com>
References: <000801c714e3$257450a0$92cc180a@amr.corp.intel.com>
	<45816355.4010801@voltaire.com> <45819093.3090405@ichips.intel.com>
	<15ddcffd0612141251k6c9bfdfdg9060bf0e95f0657e@mail.gmail.com>
	<4581C4B5.5020702@ichips.intel.com>
	<15ddcffd0612142157y4cbf0423m874547269f78e395@mail.gmail.com>
	<4582DD77.8090208@ichips.intel.com>
Message-ID: <4585210F.1050106@dev.mellanox.co.il>

Sean Hefty wrote:
>> cool, before sending the orig email i was looking on both Arlin git
>> tree at ofa staging and the svn and the code that uses this calls are
>> still there, so were are the updated udapl sources?
>>     
>
> Arlin's DAPL tree has an rdma_ucm branch that should match.
>
> - Sean
>
>
>   
Arlin and Sean,
Can you make sure the code that going to OFED 1.2 will be on the place 
where we take our daily build:
librdmacm_git="git://staging.openfabrics.org/~shefty/librdmacm.git"
dapl_git="git://staging.openfabrics.org/~ardavis/dapl.git"

Thanks,
Tziporet


From dotanb at dev.mellanox.co.il  Sun Dec 17 04:24:42 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Sun, 17 Dec 2006 14:24:42 +0200
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
Message-ID: <4585370A.5000707@dev.mellanox.co.il>

Hi.

I noticed that low level drivers from different vendors don't act the 
same when there is an error.
For example:
    when ibv_post_send fails, libmthca returns -1
    when ibv_post_send fails, libehca returns -(errno value), such as: 
-EINVAL, -ENOMEM
    (i didn't check the code of ipath)

I wrote the man pages to the libibverbs (that i hope, soon will be 
committed), and tried to describe the
return values of the verbs.

I don't think that the description(behavior) of the verb need to be 
according to the HW which is being used ..

If we are going to change to change the return values to a common 
behavior i suggest to use a way
which will give more information to the user that uses the verbs (create 
IB oriented errno values(?)), or
another method which will give the user a hint of the problem. for 
example: when the user try to modify a QP
with a bad value there is an EINVAL return value for all of the values 
that he tries to modify ...

What do you think?
Dotan


From sashak at voltaire.com  Sun Dec 17 04:50:52 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 17 Dec 2006 14:50:52 +0200
Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message.
Message-ID: <20061217125052.GA2521@sashak.voltaire.com>


Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp().

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_sa_mcmember_record.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 31d1fb5..3fec8b9 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1418,8 +1418,11 @@ __osm_mcmr_rcv_leave_mgrp(
 
   mcmember_rec = *p_recvd_mcmember_rec;
 
-  if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) ) {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_mcmr_rcv_leave_mgrp: Dump of record\n" );
     osm_dump_mc_record( p_rcv->p_log, &mcmember_rec, OSM_LOG_DEBUG );
+  }
 
   CL_PLOCK_EXCL_ACQUIRE(p_rcv->p_lock);
   status = __get_mgrp_by_mgid(p_rcv,p_recvd_mcmember_rec, &p_mgrp);
-- 
1.4.4.1.gbfd3


From sashak at voltaire.com  Sun Dec 17 04:52:30 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 17 Dec 2006 14:52:30 +0200
Subject: [openib-general] [PATCH] opensm: sa mcmember_rec leave locking
Message-ID: <20061217125230.GB2521@sashak.voltaire.com>


Hold locked multicast group leave request (MCMember Record) processing.
This prevents kind of race with multicast group join request where
those requests can be reordered during processing.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_sa_mcmember_record.c |    2 +-
 osm/opensm/osm_sm.c                 |    3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/osm/opensm/osm_sa_mcmember_record.c b/osm/opensm/osm_sa_mcmember_record.c
index 3fec8b9..382dcff 100644
--- a/osm/opensm/osm_sa_mcmember_record.c
+++ b/osm/opensm/osm_sa_mcmember_record.c
@@ -1471,7 +1471,7 @@ __osm_mcmr_rcv_leave_mgrp(
         mcmember_rec.scope_state = p_mcm_port->scope_state;
 
         /* OK we can leave */
-        CL_PLOCK_RELEASE( p_rcv->p_lock );
+	/* note: osm_sm_mcgrp_leave() will release p_rcv->p_lock */
 
         status = osm_sm_mcgrp_leave(p_rcv->p_sm, mlid, portguid);
         if(status != IB_SUCCESS)
diff --git a/osm/opensm/osm_sm.c b/osm/opensm/osm_sm.c
index 70c3584..71fd847 100644
--- a/osm/opensm/osm_sm.c
+++ b/osm/opensm/osm_sm.c
@@ -776,7 +776,8 @@ osm_sm_mcgrp_leave(
    /*
     * Acquire the port object for the port leaving this group.
     */
-   CL_PLOCK_EXCL_ACQUIRE( p_sm->p_lock );
+   /* note: p_sm->p_lock is locked by caller, but will be released later
+      this function */
    p_port = ( osm_port_t * ) cl_qmap_get( &p_sm->p_subn->port_guid_tbl,
                                           port_guid );
    if( p_port ==
-- 
1.4.4.1.gbfd3


From glebn at voltaire.com  Sun Dec 17 07:42:41 2006
From: glebn at voltaire.com (glebn at voltaire.com)
Date: Sun, 17 Dec 2006 17:42:41 +0200
Subject: [openib-general] [RFC/BUG] libibverbs: DMA vs. CQ race
In-Reply-To: <ada8xhaq5ze.fsf@cisco.com>
References: <Pine.LNX.4.61.0612131626250.24974@localhost.localdomain>
	<ada8xhaq5ze.fsf@cisco.com>
Message-ID: <20061217154241.GD11360@minantech.com>

On Wed, Dec 13, 2006 at 11:41:41PM -0800, Roland Dreier wrote:
> Are there other possible ordering problems involving user memory (not
> in a CQ or QP)?  Something like a CPU on node A writing to memory on
> node B and then posting a work request that makes the HCA DMA from
> that memory on node B, and having the work request doorbell reach the
> HCA before the write to node B actually happens, so the HCA DMAs the
> old contents of node B's memory?
> 
> I guess the only feasible solution to the problem you're pointing out
> is to have libmthca use some special mmap()-based allocator for queues
> so that the kernel can give it memory that has the special
> dma_map_consistent treatment.
Do you think this should be part of mthca or some general framework like uio
which allows writing driver in userspace?
Also another solution could be to do something similar to ehca. It
allocates QP and CQ in the kernel and maps them into process address
space.

--
			Gleb.


From kliteyn at dev.mellanox.co.il  Sun Dec 17 23:30:13 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Mon, 18 Dec 2006 09:30:13 +0200
Subject: [openib-general] [PATCH] osm: Adding FatTree routing engine
	[2/2]
In-Reply-To: <1166196836.28709.188922.camel@hal.voltaire.com>
References: <4581DDFF.2000903@dev.mellanox.co.il>
	<1166196836.28709.188922.camel@hal.voltaire.com>
Message-ID: <45864385.1040105@dev.mellanox.co.il>

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Thu, 2006-12-14 at 18:27, Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> This patch (2/2) adds Fat Tree routing engine to OpenSM.
> 
> Thanks! Applied.
> 
> I played with it a little and will look more at it going forward.
> 
> A couple of questions:
> 
> Is this algorithm currently considered experimental ?

I wouldn't say that it's experimental.
It's not perfect - there are still things to improve to make
it more efficient, but the routing itself will remain intact.

> Are there any simulator tests/regressions for this ?

There is a bunch of simulation tests for this engine,
but they're not integrated into the nightly simulation regression yet.
It's on my to-do list. 
 
> Also, could you or Eitan update doc/current-routing.txt with a
> description of the fat tree algorithm and send that patch to me ?

Sure.

-- Yevgeny
 
> -- Hal
> 


From devesh28 at gmail.com  Mon Dec 18 00:17:23 2006
From: devesh28 at gmail.com (Devesh Sharma)
Date: Mon, 18 Dec 2006 13:47:23 +0530
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <1166104604.28709.126501.camel@hal.voltaire.com>
References: <2875.47466.qm@web8317.mail.in.yahoo.com>
	<1166104604.28709.126501.camel@hal.voltaire.com>
Message-ID: <309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com>

On similar lines I have a confusion about the mad agent creation:-
 there is a function in mad.c   ib_agent_port_open() which creates
_send_only_ SMAs for GSI and SMI per port.

There is a function in mthca_mad.c mthca_create_agents() which is
_again_ createing two send only mad agents for SMI and GSI.

Why this driver specific agent creation is required?

On 14 Dec 2006 08:57:11 -0500, Hal Rosenstock <halr at voltaire.com> wrote:
> On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote:
> > thanks for your reply,
> >
> > >The driver is needed to obtain the information for the IB node to
> > fill
> > >in the MADs for response to the SMA query. It may also issue some
> > traps.
> > >Similarly for PMA as well.
> >
> > Do u mean to say that HCA driver is needed to pass the HCA related
> > information (like GID, GUID, port_info etc..) to the SMA so that it
> > can reply to query(or GET ) MADs.
>
> Yes.
>
> >  Isn't SMA capable of doing the same by using "query_(gid, pkey,
> > port)" verbs.
>
> One reason I can think of is that not all the needed information is
> available via verbs. I think there are some others as well.
>
> > And final  questions  if it is really required to implement
> > 'process_mad' in HCA driver then why it is not specified in the IB
> > specifications.
>
> IB spec is architecture not implementation.
>
> > Whose duty is this (replying to query MADs) according to the IB
> > psec.s(its duty of SMA right?)
>
> Depends on the MAD but if you are referring to the SMA queries, then yes
> it is the SMA's responsibility.
>
> > I have observed that process_mad is not implemented in the IBM's eHCA
> > driver. what is the case with it?
>
> With eHCA, QP0 is not exposed to the host (at least currently) and the
> SMA is totally implemented in firmware.
>
> > PS: I am considering only SMA in the host s/w here.
>
> This is a design choice.
>
> -- Hal
>
> > regards,
> > K.Mahesh.
> >
> >
> >
> >
> > Hal Rosenstock <halr at voltaire.com> wrote:
> >         On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote:
> >         > Hello all,
> >         >
> >         > I want to know from u people that isi it necessary to
> >         implement the
> >         > process_mad for a HCA.
> >         >
> >         > After looking into the implementations of process_mad in
> >         ipath and
> >         > mthca drivers i have fount that they are used to reply the
> >         MADs with
> >         > port_info,gid_info,sm_info etc..
> >         >
> >         > But isn't it handled by SMA in the host......
> >
> >         The SMA can either be in the host on in firmware (as is
> >         typical with the
> >         Mellanox silicon).
> >
> >         > i am little bit confused now .
> >         > please just whether it is required to implement process_mad
> >         (suppose)
> >         > for new HCA driver....
> >
> >         It is. For an example of a host (software SMA), see
> >         drivers/infiniband/hw/ipath/ipath_mad.c
> >
> >         > if it is required why?
> >
> >         The driver is needed to obtain the information for the IB node
> >         to fill
> >         in the MADs for response to the SMA query. It may also issue
> >         some traps.
> >         Similarly for PMA as well.
> >
> >         -- Hal
> >
> >         > Please CC your replies to me.
> >         >
> >         > regards,
> >         > K.Mahesh.
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         >
> >         ______________________________________________________________________
> >         > Find out what India is talking about on - Yahoo! Answers
> >         India
> >         > Send FREE SMS to your friend's mobile from Yahoo! Messenger
> >         Version 8.
> >         > Get it NOW
> >         >
> >         >
> >         ______________________________________________________________________
> >         >
> >         > _______________________________________________
> >         > openib-general mailing list
> >         > openib-general at openib.org
> >         > http://openib.org/mailman/listinfo/openib-general
> >         >
> >         > To unsubscribe, please visit
> >         http://openib.org/mailman/listinfo/openib-general
> >
> >
> >
> > ______________________________________________________________________
> >  Find out what India is talking about on - Yahoo! Answers India
> > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> > Get it NOW
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>


From HNGUYEN at de.ibm.com  Mon Dec 18 01:22:29 2006
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 18 Dec 2006 10:22:29 +0100
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <4585370A.5000707@dev.mellanox.co.il>
Message-ID: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>

Hi Dotan!
> I noticed that low level drivers from different vendors don't act the
> same when there is an error.
> For example:
>     when ibv_post_send fails, libmthca returns -1
>     when ibv_post_send fails, libehca returns -(errno value), such as:
> -EINVAL, -ENOMEM
>     (i didn't check the code of ipath)
>
> I wrote the man pages to the libibverbs (that i hope, soon will be
> committed), and tried to describe the
> return values of the verbs.
>
> I don't think that the description(behavior) of the verb need to be
> according to the HW which is being used ..
>
> If we are going to change to change the return values to a common
> behavior i suggest to use a way
> which will give more information to the user that uses the verbs (create
> IB oriented errno values(?)), or
> another method which will give the user a hint of the problem. for
> example: when the user try to modify a QP
> with a bad value there is an EINVAL return value for all of the values
> that he tries to modify ...
>
> What do you think?
Good point. I can speak for ehca only. We prefer to reuse existing
errno values and not to define new ones as it's also a question of
how much information we want to tell the consumer in case of error
and what it can handle for. To me the defined errno values give
enough information to caller. Anyway we should use same error
codes for both kernel and user space verbs.
Regards
Nam


From mst at mellanox.co.il  Mon Dec 18 01:41:22 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 11:41:22 +0200
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
References: <4585370A.5000707@dev.mellanox.co.il>
	<OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
Message-ID: <20061218094122.GA3169@mellanox.co.il>

> Anyway we should use same error
> codes for both kernel and user space verbs.

This actually does not sound like a good idea.
In particular returning -<errno> values, or incoding them in pointers
by means of PTR_ERR is the standard in linux kernel but seems quite
nonstandard for a userspace library.

-- 
MST


From HNGUYEN at de.ibm.com  Mon Dec 18 01:49:41 2006
From: HNGUYEN at de.ibm.com (Hoang-Nam Nguyen)
Date: Mon, 18 Dec 2006 10:49:41 +0100
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <20061218094122.GA3169@mellanox.co.il>
Message-ID: <OF29D7F649.B23189EF-ONC1257248.0035CC5E-C1257248.0035FCDF@de.ibm.com>


Hi Michael!
> > Anyway we should use same error
> > codes for both kernel and user space verbs.
>
> This actually does not sound like a good idea.
> In particular returning -<errno> values, or incoding them in pointers
> by means of PTR_ERR is the standard in linux kernel but seems quite
> nonstandard for a userspace library.
Oops, you're right in that case. I've overseen it.
Thx
Nam


From philippe_bernadat at hp.com  Mon Dec 18 03:18:44 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Mon, 18 Dec 2006 12:18:44 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <1166210069.28709.196688.camel@hal.voltaire.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net>

I think I am going to need more help here.

I did use both tricks, opensm enable_quirks TRUE & rdma_cm
tavor_quirk=1.
This seems to have no effect.

But I may be doing something wrong. So some questions I have:

1) Doc (sdp_release_notes.txt, see below) says we can use either of the
two tricks, is it really the case ?

2) I usually don't run opensm (not required for me till now) and I am
not too familiar with it. But I did, so that I could try the
enable_quirks TRUE quirk option. Does opensm run in background, when I
run it never returns, last messages are:
    >>>> -------------------------------------------------
    >>>> OpenSM Rev:openib-2.0.5
    >>>> Based on OpenIB svn Exported revision
    >>>>  Using Cached Option:guid = 0x0008f10403961e4d
    >>>>  Using Cached Option:log_flags = 3
    >>>>  Using Cached Option:enable_quirks = TRUE
    >>>> Command Line Arguments:
    >>>>  Log File: /var/log/osm.log
    >>>> -------------------------------------------------
    >>>> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision
    >>>> 
    >>>> Entering STANDBY state

3) Is there a way to change the MTU from within the lustre LND kernel
module. I saw that the IB perf programs did this with the modify_qp()
APIs.

4) And by the way, I can confirm that the MTU is the issue. Forcing it
to 2K with the ib_witre_perf test also degrades performance.


Extract from sdp_release_notes.txt

- By default, SDP utilizes a 2 Kbyte MTU size.  This may cause PCI-X
cards
  using Mellanox Technologies "Infinihost" HCAs to experience low
bandwidth.
  Workaround:  reset the MTU size to 1K in this situation, using either
of
  the two methods below:

  1. Activate the "tavor quirk" workaround in opensm:
     a. Create an opensm options cache file
(/var/cache/osm/opensm.opts):
          > opensm --cache-options -o
     b. Add the following line to /var/cache/osm/opensm.opts:
          enable_quirks TRUE
     c. Rerun opensm using your usual command line options to activate
        the opensm quirk option.

  2. Activate the "tavor quirk" workaround in cma:
       set the tavor_quirk module parameter of the rdma_cm module to
value 1
       (default: 0).

Philippe

> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Friday, December 15, 2006 8:15 PM
> To: Eitan Zahavi
> Cc: Matt L. Leininger; Roland Dreier; Bernadat, Philippe; 
> openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote:
> > Matt Leininger wrote:
> > > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote:
> > >   
> > >> I also looked at the HCA counters, and I indeed think 
> > >> there is something wrong about the MTU:
> > >>
> > >> For the same test
> > >>
> > >> With VIB
> > >>
> > >> PortXmitData:                  2684490382
> > >> PortRcvData:                      1750145
> > >> PortXmitPkts:                    10280007
> > >> PortRcvPkts:                        49962
> > >>
> > >> With OFED
> > >>
> > >> XmtBytes:........................2653730483
> > >> RcvBytes:........................1710541
> > >> XmtPkts:.........................5160009
> > >> RcvPkts:.........................50012
> > >>
> > >> Which means we sent half less packets with OFED 
> > >> and if you do the math it is 2K packets with OFED 
> (counters are 32bit
> > >> units)
> > >> and 1K packets with VIB.
> > >>
> > >> So fo some reason the tavor_quirk param is ignored/overwriten.
> > >> Is there an interface to control this ?
> > >>     
> > >
> > >   Michael said you have to turn on this feature in 
> OpenSM.  From the
> > > release notes I'm not sure how you turn it on in OpenSM.  
> You did turn
> > > on the tavor mtu work around in the rdma_cm, but did you 
> turn it on in
> > > OpenSM?  Also what version of OpenSM are you running?
> > >   
> > To turn this option on in opensm you need to:
> > 1. Run: opensm -c -o
> 
> If you already have an opensm.opts file then you can skip this step.
> 
> -- Hal
> 
> > 2. Modify the file /var/cache/osm/opensm.opts by changing 
> the line below
> > enable_quirks FALSE
> > to
> > enable_quirks TRUE
> > 
> > 3. Run: opensm
> > >   Thanks,
> > >
> > > 	- Matt
> > >
> > >   
> > >> Philippe
> > >>
> > >>     
> > >>> -----Original Message-----
> > >>> From: Bernadat, Philippe 
> > >>> Sent: Friday, December 15, 2006 8:59 AM
> > >>> To: Michael S. Tsirkin; Roland Dreier
> > >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
> > >>> Subject: RE: Performance Degradation with OFED v. 
> Voltaire (lustre)
> > >>>
> > >>> I have set tavor_quirk to 1 with no effect.
> > >>> Another thing I have tried is the same lustre 
> > >>> LNET echo test with a single thread (vs 8)
> > >>>
> > >>> VIB:      400 MB/s
> > >>> OFED-1.1: 333 MB/s
> > >>>
> > >>> I am posting the live param values for all infiniband 
> > >>> modules in case someone could identify some wrong setting:
> > >>>
> > >>> infiniband/core/ib_cm
> > >>>
> > >>> mra_timeout_limit              30000
> > >>>
> > >>> infiniband/core/rdma_cm
> > >>>
> > >>> max_cm_retries                    15
> > >>> tavor_quirk                        1
> > >>>
> > >>> infiniband/hw/ipath/ib_ipath
> > >>>
> > >>> cfgports                           0
> > >>> debug                              1
> > >>> disable_sma                        0
> > >>> kpiobufs                           0
> > >>> lkey_table_size                   12
> > >>> max_ahs                        65535
> > >>> max_cqes                      196607
> > >>> max_cqs                       131071
> > >>> max_mcast_grps                 16384
> > >>> max_mcast_qp_attached             16
> > >>> max_pds                        65535
> > >>> max_qps                        16384
> > >>> max_qp_wrs                     16383
> > >>> max_sges                          96
> > >>> max_srqs                        1024
> > >>> max_srq_sges                     128
> > >>> max_srq_wrs                   131071
> > >>> qp_table_size                    251
> > >>>
> > >>> infiniband/hw/mthca/ib_mthca
> > >>>
> > >>> catas_reset_disable                0
> > >>> debug_level                        0
> > >>> fmr_reserved_mtts             262144
> > >>> fw_cmd_doorbell                    0
> > >>> msi                                0
> > >>> msi_x                              1
> > >>> num_cq                         65536
> > >>> num_mcg                         8192
> > >>> num_mpt                       131072
> > >>> num_mtt                      1048576
> > >>> num_qp                         65536
> > >>> num_udav                       32768
> > >>> rdb_per_qp                         4
> > >>> tune_pci                           1
> > >>>
> > >>> infiniband/ulp/ipoib/ib_ipoib
> > >>>
> > >>> debug_level                        0
> > >>> mcast_debug_level                  0
> > >>> recv_queue_size                  128
> > >>> send_queue_size                   64
> > >>>
> > >>> Philippe
> > >>>
> > >>>       
> > >>>> -----Original Message-----
> > >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > >>>> Sent: Thursday, December 14, 2006 6:32 PM
> > >>>> To: Roland Dreier
> > >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> > >>>> openib-general at openib.org
> > >>>> Subject: Re: Performance Degradation with OFED v. Voltaire
> > >>>>
> > >>>>         
> > >>>>>  > I think Eric described the major differences earlier on, 
> > >>>>>           
> > >>>> here it is, see
> > >>>>         
> > >>>>>  > second half:
> > >>>>>
> > >>>>> OK, I forgot about that.
> > >>>>>
> > >>>>> I guess one last thing to check would be the MTU being used 
> > >>>>>           
> > >>>> for the RC
> > >>>>         
> > >>>>> connections.  Since this is PCI-X HW then the MTU should 
> > >>>>>           
> > >>> be 1024 for
> > >>>       
> > >>>>> best throughput (instead of the max MTU of 2048).
> > >>>>>           
> > >>>> The MTU issue is described in the OFED release notes.
> > >>>> You must turn the Tavor work-around for it on in opensm.
> > >>>> This was introduced late in release cycle to it was 
> deemed safer
> > >>>> to make it off by default.
> > >>>>
> > >>>> By the way, Eitan, Hal, can we turn this on by default now?
> > >>>> This was we'll get more feedback from people, and 
> we'll still have
> > >>>> time to turn it off before release if this unexpectedly 
> > >>>> creates issues.
> > >>>>
> > >>>> -- 
> > >>>> MST
> > >>>>
> > >>>>         
> > >> _______________________________________________
> > >> openib-general mailing list
> > >> openib-general at openib.org
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >>     
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > >   
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > 
> 
> 


From ogerlitz at voltaire.com  Mon Dec 18 03:19:05 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 18 Dec 2006 13:19:05 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <20061214173145.GC12781@mellanox.co.il>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055380F3@idaexc03.emea.cpqcorp.net>
	<ada3b7io2w3.fsf@cisco.com> <20061214173145.GC12781@mellanox.co.il>
Message-ID: <45867929.4080300@voltaire.com>

Michael S. Tsirkin wrote:
>> I guess one last thing to check would be the MTU being used for the RC
>> connections.  Since this is PCI-X HW then the MTU should be 1024 for
>> best throughput (instead of the max MTU of 2048).

> The MTU issue is described in the OFED release notes.
> You must turn the Tavor work-around for it on in opensm.
> This was introduced late in release cycle to it was deemed safer
> to make it off by default.

Michael,

Let me see i follow you correct: a user must enable the tavor quirk in 
the **openSM** ? what about the cma_tavor_quirk? and what about users 
who want to use OFED with commercial/3rd party SMs ??? looking in the 
OFED 1.1 docs it is mentioned that either way should work.

Looking on kernel_patches/fixes/cma_tavor_quirk.patch of OFED 1.1 
(below) the thing seems to me uncompleted as the 
IB_SA_PATH_REC_MTU_SELECTOR and IB_SA_PATH_REC_MTU bits are not set in 
the component mask of the path record query done by the cma, am i 
missing something?

Or.

> Tavor systems get better performance with 1K MTU. Since there does
> not seem to be any way to find out whether the remote system uses Tavor,
> add an option to limit the MTU globally.
> 
> Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
> 
> Index: linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c
> ===================================================================
> --- linux-2.6.18-rc2-devel.orig/drivers/infiniband/core/cma.c   2006-09-11 16:01:37.000000000 +0300
> +++ linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c        2006-09-13 18:51:45.000000000 +0300
> @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
>  MODULE_DESCRIPTION("Generic RDMA CM Agent");
>  MODULE_LICENSE("Dual BSD/GPL");
> 
> +static int tavor_quirk = 0;
> +module_param_named(tavor_quirk, tavor_quirk, int, 0644);
> +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0");
> +
>  #define CMA_CM_RESPONSE_TIMEOUT 20
>  #define CMA_MAX_CM_RETRIES 3
> 
> @@ -1123,6 +1127,11 @@ static int cma_query_ib_route(struct rdm
>         path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
>         path_rec.numb_path = 1;
> 
> +       if (tavor_quirk) {
> +               path_rec.mtu_selector = IB_SA_LT;
> +               path_rec.mtu = IB_MTU_2048;
> +       }
> +
>         id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
>                                 id_priv->id.port_num, &path_rec,
>                                 IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID |


From eitan at sw053.yok.mtl.com  Mon Dec 18 03:19:31 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Mon, 18 Dec 2006 13:19:31 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-18:normal completion
Message-ID: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 
ibutils rev = Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1
Total=221 Pass=219 Fail=2

Pass:
31 LidMgr IS1-16.topo
30 Stability IS1-16.topo
30 Pkey IS1-16.topo
30 Multicast IS1-16.topo
29 OsmStress IS1-16.topo
10 Stability IS3-loop.topo
10 Stability IS3-128.topo
10 Pkey IS3-128.topo
10 Multicast IS3-loop.topo
10 Multicast IS3-128.topo
10 LidMgr IS3-128.topo
9 OsmStress IS3-128.topo

Failures:
1 OsmStress IS3-128.topo
1 OsmStress IS1-16.topo


From mst at mellanox.co.il  Mon Dec 18 03:35:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 13:35:02 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <45867929.4080300@voltaire.com>
References: <45867929.4080300@voltaire.com>
Message-ID: <20061218113502.GB3169@mellanox.co.il>

> >> I guess one last thing to check would be the MTU being used for the RC
> >> connections.  Since this is PCI-X HW then the MTU should be 1024 for
> >> best throughput (instead of the max MTU of 2048).
> 
> > The MTU issue is described in the OFED release notes.
> > You must turn the Tavor work-around for it on in opensm.
> > This was introduced late in release cycle to it was deemed safer
> > to make it off by default.
> 
> Michael,
> 
> Let me see i follow you correct: a user must enable the tavor quirk in 
> the **openSM** ? what about the cma_tavor_quirk? and what about users 
> who want to use OFED with commercial/3rd party SMs ??? looking in the 
> OFED 1.1 docs it is mentioned that either way should work.

Right. But CMA quirk can only work if OFED CMA is initiating the connection
from Tavor (i.e. it can't handle Arbel->Tavor case).
Enabling this in Opensm solves the problem for all ULPs, and in all cases -
whether Tavor is active or passive side in the connection.
So fixing this in the SM is clearly the best solution.

Further, as you point out the cma quirk patch in OFED looks broken :(.

> Looking on kernel_patches/fixes/cma_tavor_quirk.patch of OFED 1.1 
> (below) the thing seems to me uncompleted as the 
> IB_SA_PATH_REC_MTU_SELECTOR and IB_SA_PATH_REC_MTU bits are not set in 
> the component mask of the path record query done by the cma, am i 
> missing something?
> 
> Or.

Correct, looks like that bit is missing.

> Tavor systems get better performance with 1K MTU. Since there does
> not seem to be any way to find out whether the remote system uses Tavor,
> add an option to limit the MTU globally.
> 
> Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
> 
> Index: linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c
> ===================================================================
> --- linux-2.6.18-rc2-devel.orig/drivers/infiniband/core/cma.c   2006-09-11 16:01:37.000000000 +0300
> +++ linux-2.6.18-rc2-devel/drivers/infiniband/core/cma.c        2006-09-13 18:51:45.000000000 +0300
> @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
>  MODULE_DESCRIPTION("Generic RDMA CM Agent");
>  MODULE_LICENSE("Dual BSD/GPL");
> 
> +static int tavor_quirk = 0;
> +module_param_named(tavor_quirk, tavor_quirk, int, 0644);
> +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0");
> +
>  #define CMA_CM_RESPONSE_TIMEOUT 20
>  #define CMA_MAX_CM_RETRIES 3
> 
> @@ -1123,6 +1127,11 @@ static int cma_query_ib_route(struct rdm
>         path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
>         path_rec.numb_path = 1;
> 
> +       if (tavor_quirk) {
> +               path_rec.mtu_selector = IB_SA_LT;
> +               path_rec.mtu = IB_MTU_2048;
> +       }
> +
>         id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
>                                 id_priv->id.port_num, &path_rec,
>                                 IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID |

-- 
MST


From mst at mellanox.co.il  Mon Dec 18 03:37:50 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 13:37:50 +0200
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net>
References: <1166210069.28709.196688.camel@hal.voltaire.com>
	<3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net>
Message-ID: <20061218113750.GC3169@mellanox.co.il>

cma quirk seems not to work.
Enabling the opensm quirk should work, and should be sufficient.
However, you seem to be running another SM on your fabric (on your switch?)
that's why it enters STANDBY. Disable that and try again.


Quoting r. Bernadat, Philippe <philippe_bernadat at hp.com>:
Subject: Re: Performance Degradation with OFED v. Voltaire(lustre)

I think I am going to need more help here.

I did use both tricks, opensm enable_quirks TRUE & rdma_cm
tavor_quirk=1.
This seems to have no effect.

But I may be doing something wrong. So some questions I have:

1) Doc (sdp_release_notes.txt, see below) says we can use either of the
two tricks, is it really the case ?

2) I usually don't run opensm (not required for me till now) and I am
not too familiar with it. But I did, so that I could try the
enable_quirks TRUE quirk option. Does opensm run in background, when I
run it never returns, last messages are:
    >>>> -------------------------------------------------
    >>>> OpenSM Rev:openib-2.0.5
    >>>> Based on OpenIB svn Exported revision
    >>>>  Using Cached Option:guid = 0x0008f10403961e4d
    >>>>  Using Cached Option:log_flags = 3
    >>>>  Using Cached Option:enable_quirks = TRUE
    >>>> Command Line Arguments:
    >>>>  Log File: /var/log/osm.log
    >>>> -------------------------------------------------
    >>>> OpenSM Rev:openib-2.0.5 OpenIB svn Exported revision
    >>>> 
    >>>> Entering STANDBY state

3) Is there a way to change the MTU from within the lustre LND kernel
module. I saw that the IB perf programs did this with the modify_qp()
APIs.

4) And by the way, I can confirm that the MTU is the issue. Forcing it
to 2K with the ib_witre_perf test also degrades performance.


Extract from sdp_release_notes.txt

- By default, SDP utilizes a 2 Kbyte MTU size.  This may cause PCI-X
cards
  using Mellanox Technologies "Infinihost" HCAs to experience low
bandwidth.
  Workaround:  reset the MTU size to 1K in this situation, using either
of
  the two methods below:

  1. Activate the "tavor quirk" workaround in opensm:
     a. Create an opensm options cache file
(/var/cache/osm/opensm.opts):
          > opensm --cache-options -o
     b. Add the following line to /var/cache/osm/opensm.opts:
          enable_quirks TRUE
     c. Rerun opensm using your usual command line options to activate
        the opensm quirk option.

  2. Activate the "tavor quirk" workaround in cma:
       set the tavor_quirk module parameter of the rdma_cm module to
value 1
       (default: 0).

Philippe

> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Friday, December 15, 2006 8:15 PM
> To: Eitan Zahavi
> Cc: Matt L. Leininger; Roland Dreier; Bernadat, Philippe; 
> openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> On Fri, 2006-12-15 at 12:20, Eitan Zahavi wrote:
> > Matt Leininger wrote:
> > > On Fri, 2006-12-15 at 09:44 +0100, Bernadat, Philippe wrote:
> > >   
> > >> I also looked at the HCA counters, and I indeed think 
> > >> there is something wrong about the MTU:
> > >>
> > >> For the same test
> > >>
> > >> With VIB
> > >>
> > >> PortXmitData:                  2684490382
> > >> PortRcvData:                      1750145
> > >> PortXmitPkts:                    10280007
> > >> PortRcvPkts:                        49962
> > >>
> > >> With OFED
> > >>
> > >> XmtBytes:........................2653730483
> > >> RcvBytes:........................1710541
> > >> XmtPkts:.........................5160009
> > >> RcvPkts:.........................50012
> > >>
> > >> Which means we sent half less packets with OFED 
> > >> and if you do the math it is 2K packets with OFED 
> (counters are 32bit
> > >> units)
> > >> and 1K packets with VIB.
> > >>
> > >> So fo some reason the tavor_quirk param is ignored/overwriten.
> > >> Is there an interface to control this ?
> > >>     
> > >
> > >   Michael said you have to turn on this feature in 
> OpenSM.  From the
> > > release notes I'm not sure how you turn it on in OpenSM.  
> You did turn
> > > on the tavor mtu work around in the rdma_cm, but did you 
> turn it on in
> > > OpenSM?  Also what version of OpenSM are you running?
> > >   
> > To turn this option on in opensm you need to:
> > 1. Run: opensm -c -o
> 
> If you already have an opensm.opts file then you can skip this step.
> 
> -- Hal
> 
> > 2. Modify the file /var/cache/osm/opensm.opts by changing 
> the line below
> > enable_quirks FALSE
> > to
> > enable_quirks TRUE
> > 
> > 3. Run: opensm
> > >   Thanks,
> > >
> > > 	- Matt
> > >
> > >   
> > >> Philippe
> > >>
> > >>     
> > >>> -----Original Message-----
> > >>> From: Bernadat, Philippe 
> > >>> Sent: Friday, December 15, 2006 8:59 AM
> > >>> To: Michael S. Tsirkin; Roland Dreier
> > >>> Cc: Eitan Zahavi; Hal Rosenstock; openib-general at openib.org
> > >>> Subject: RE: Performance Degradation with OFED v. 
> Voltaire (lustre)
> > >>>
> > >>> I have set tavor_quirk to 1 with no effect.
> > >>> Another thing I have tried is the same lustre 
> > >>> LNET echo test with a single thread (vs 8)
> > >>>
> > >>> VIB:      400 MB/s
> > >>> OFED-1.1: 333 MB/s
> > >>>
> > >>> I am posting the live param values for all infiniband 
> > >>> modules in case someone could identify some wrong setting:
> > >>>
> > >>> infiniband/core/ib_cm
> > >>>
> > >>> mra_timeout_limit              30000
> > >>>
> > >>> infiniband/core/rdma_cm
> > >>>
> > >>> max_cm_retries                    15
> > >>> tavor_quirk                        1
> > >>>
> > >>> infiniband/hw/ipath/ib_ipath
> > >>>
> > >>> cfgports                           0
> > >>> debug                              1
> > >>> disable_sma                        0
> > >>> kpiobufs                           0
> > >>> lkey_table_size                   12
> > >>> max_ahs                        65535
> > >>> max_cqes                      196607
> > >>> max_cqs                       131071
> > >>> max_mcast_grps                 16384
> > >>> max_mcast_qp_attached             16
> > >>> max_pds                        65535
> > >>> max_qps                        16384
> > >>> max_qp_wrs                     16383
> > >>> max_sges                          96
> > >>> max_srqs                        1024
> > >>> max_srq_sges                     128
> > >>> max_srq_wrs                   131071
> > >>> qp_table_size                    251
> > >>>
> > >>> infiniband/hw/mthca/ib_mthca
> > >>>
> > >>> catas_reset_disable                0
> > >>> debug_level                        0
> > >>> fmr_reserved_mtts             262144
> > >>> fw_cmd_doorbell                    0
> > >>> msi                                0
> > >>> msi_x                              1
> > >>> num_cq                         65536
> > >>> num_mcg                         8192
> > >>> num_mpt                       131072
> > >>> num_mtt                      1048576
> > >>> num_qp                         65536
> > >>> num_udav                       32768
> > >>> rdb_per_qp                         4
> > >>> tune_pci                           1
> > >>>
> > >>> infiniband/ulp/ipoib/ib_ipoib
> > >>>
> > >>> debug_level                        0
> > >>> mcast_debug_level                  0
> > >>> recv_queue_size                  128
> > >>> send_queue_size                   64
> > >>>
> > >>> Philippe
> > >>>
> > >>>       
> > >>>> -----Original Message-----
> > >>>> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > >>>> Sent: Thursday, December 14, 2006 6:32 PM
> > >>>> To: Roland Dreier
> > >>>> Cc: Bernadat, Philippe; Eitan Zahavi; Hal Rosenstock; 
> > >>>> openib-general at openib.org
> > >>>> Subject: Re: Performance Degradation with OFED v. Voltaire
> > >>>>
> > >>>>         
> > >>>>>  > I think Eric described the major differences earlier on, 
> > >>>>>           
> > >>>> here it is, see
> > >>>>         
> > >>>>>  > second half:
> > >>>>>
> > >>>>> OK, I forgot about that.
> > >>>>>
> > >>>>> I guess one last thing to check would be the MTU being used 
> > >>>>>           
> > >>>> for the RC
> > >>>>         
> > >>>>> connections.  Since this is PCI-X HW then the MTU should 
> > >>>>>           
> > >>> be 1024 for
> > >>>       
> > >>>>> best throughput (instead of the max MTU of 2048).
> > >>>>>           
> > >>>> The MTU issue is described in the OFED release notes.
> > >>>> You must turn the Tavor work-around for it on in opensm.
> > >>>> This was introduced late in release cycle to it was 
> deemed safer
> > >>>> to make it off by default.
> > >>>>
> > >>>> By the way, Eitan, Hal, can we turn this on by default now?
> > >>>> This was we'll get more feedback from people, and 
> we'll still have
> > >>>> time to turn it off before release if this unexpectedly 
> > >>>> creates issues.
> > >>>>
> > >>>> -- 
> > >>>> MST
> > >>>>
> > >>>>         
> > >> _______________________________________________
> > >> openib-general mailing list
> > >> openib-general at openib.org
> > >> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > >>
> > >>     
> > >
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > >
> > > To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > >   
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> > 
> 
> 

_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From dotanb at dev.mellanox.co.il  Mon Dec 18 03:47:09 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 18 Dec 2006 13:47:09 +0200
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
References: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
Message-ID: <45867FBD.9040300@dev.mellanox.co.il>

Hi Hoang-Nam.

Hoang-Nam Nguyen wrote:
> Hi Dotan!
>   
> Good point. I can speak for ehca only. We prefer to reuse existing
> errno values and not to define new ones as it's also a question of
> how much information we want to tell the consumer in case of error
> and what it can handle for. To me the defined errno values give
> enough information to caller. Anyway we should use same error
> codes for both kernel and user space verbs.
> Regards
> Nam
>   
I think that there should be 2 modes to the drivers:
mode 1 (release mode): return "standard" errno values
mode 2 (debug mode) : return "IB oriented" values
This can be done in compilation time, for example:
#ifdef IB_DEBUG
#define IB_EINVAL_MTU 1000
#define IB_EINVAL_LID    1001
#else
#define IB_EINVAL_MTU EINVAL
#define IB_EINVAL_LID    EINVAL
#endif
 
This way, we will be able to help developers to find out what is the 
problem in case of an error when using debug driver.

Anyway, we need to decide on a common behavior of all low level drivers.

thanks
Dotan


From ogerlitz at voltaire.com  Mon Dec 18 04:03:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 18 Dec 2006 14:03:22 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <20061218113502.GB3169@mellanox.co.il>
References: <45867929.4080300@voltaire.com>
	<20061218113502.GB3169@mellanox.co.il>
Message-ID: <4586838A.3040500@voltaire.com>

Philippe, can you try this patch, i have problems with setting a 
compilation env now but it should work.

unpack OFED 1.1, copy this to 
OFED-1.1/openib-1.1/kernel_patches/fixes/xxx_cma_tavor_quirk.txt
and then pack OFED 1.1 and rebuild

Or.

> Index: openib-1.1/drivers/infiniband/core/cma.c
> ===================================================================
> --- openib-1.1.orig/drivers/infiniband/core/cma.c       2006-12-18 13:27:45.213587734 +0200
> +++ openib-1.1/drivers/infiniband/core/cma.c    2006-12-18 13:34:24.921455159 +0200
> @@ -1117,6 +1117,9 @@ static void cma_query_handler(int status
>         route = &work->id->id.route;
> 
>         if (!status) {
> +               /* XXX - if returned path MTU is 2K force it to be 1K */
> +               if(path_rec->mtu == IB_MTU_2048)
> +                       path_rec->mtu = IB_MTU_1024;
>                 route->num_paths = 1;
>                 *route->path_rec = *path_rec;
>         } else {


From halr at voltaire.com  Mon Dec 18 04:09:02 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 07:09:02 -0500
Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message.
In-Reply-To: <20061217125052.GA2521@sashak.voltaire.com>
References: <20061217125052.GA2521@sashak.voltaire.com>
Message-ID: <1166443684.32666.178303.camel@hal.voltaire.com>

On Sun, 2006-12-17 at 07:50, Sasha Khapyorsky wrote:
> Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>


From mst at mellanox.co.il  Mon Dec 18 04:23:22 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 14:23:22 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <4586838A.3040500@voltaire.com>
References: <4586838A.3040500@voltaire.com>
Message-ID: <20061218122322.GD3169@mellanox.co.il>

Setting selectors for path query would be cleaner, no?

Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
Subject: Re: [openib-general] Performance Degradation with OFED v. Voltaire

Philippe, can you try this patch, i have problems with setting a 
compilation env now but it should work.

unpack OFED 1.1, copy this to 
OFED-1.1/openib-1.1/kernel_patches/fixes/xxx_cma_tavor_quirk.txt
and then pack OFED 1.1 and rebuild

Or.

> Index: openib-1.1/drivers/infiniband/core/cma.c
> ===================================================================
> --- openib-1.1.orig/drivers/infiniband/core/cma.c       2006-12-18 13:27:45.213587734 +0200
> +++ openib-1.1/drivers/infiniband/core/cma.c    2006-12-18 13:34:24.921455159 +0200
> @@ -1117,6 +1117,9 @@ static void cma_query_handler(int status
>         route = &work->id->id.route;
> 
>         if (!status) {
> +               /* XXX - if returned path MTU is 2K force it to be 1K */
> +               if(path_rec->mtu == IB_MTU_2048)
> +                       path_rec->mtu = IB_MTU_1024;
>                 route->num_paths = 1;
>                 *route->path_rec = *path_rec;
>         } else {

-- 
MST


From halr at voltaire.com  Mon Dec 18 04:20:51 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 07:20:51 -0500
Subject: [openib-general] [PATCH TRIVIAL] opensm: better log message.
In-Reply-To: <20061217125052.GA2521@sashak.voltaire.com>
References: <20061217125052.GA2521@sashak.voltaire.com>
Message-ID: <1166444405.32666.178789.camel@hal.voltaire.com>

On Sun, 2006-12-17 at 07:50, Sasha Khapyorsky wrote:
> Better log message for mcrecord dumping in __osm_mcmr_rcv_leave_mgrp().
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Mon Dec 18 04:43:50 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 07:43:50 -0500
Subject: [openib-general] [PATCH] opensm: sa mcmember_rec leave locking
In-Reply-To: <20061217125230.GB2521@sashak.voltaire.com>
References: <20061217125230.GB2521@sashak.voltaire.com>
Message-ID: <1166444463.32666.178845.camel@hal.voltaire.com>

On Sun, 2006-12-17 at 07:52, Sasha Khapyorsky wrote:
> Hold locked multicast group leave request (MCMember Record) processing.
> This prevents kind of race with multicast group join request where
> those requests can be reordered during processing.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From ogerlitz at voltaire.com  Mon Dec 18 04:49:19 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 18 Dec 2006 14:49:19 +0200
Subject: [openib-general] Performance Degradation with OFED v. Voltaire
In-Reply-To: <20061218122322.GD3169@mellanox.co.il>
References: <4586838A.3040500@voltaire.com>
	<20061218122322.GD3169@mellanox.co.il>
Message-ID: <45868E4F.9020708@voltaire.com>

Michael S. Tsirkin wrote:
> Setting selectors for path query would be cleaner, no?

yes, but first i want to do it very-hard-coded and see if the 
performance diff problem is solved and then to productize it...

Or.


From ogerlitz at voltaire.com  Mon Dec 18 05:03:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 18 Dec 2006 15:03:22 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E0557100D@idaexc03.emea.cpqcorp.net>
Message-ID: <4586919A.7060000@voltaire.com>

Bernadat, Philippe wrote:
> 3) Is there a way to change the MTU from within the lustre LND kernel
> module. I saw that the IB perf programs did this with the modify_qp()
> APIs.

yes, go to the place where the lustre NLD active side gets 
RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set

	lustre_id->route->path_rec->mtu = IB_MTU_1024;

Or.


From sashak at voltaire.com  Mon Dec 18 05:30:32 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 18 Dec 2006 15:30:32 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-18:normal
	completion
In-Reply-To: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com>
References: <200612181119.kBIBJVLN029482@sw053.yok.mtl.com>
Message-ID: <20061218133032.GC4808@sashak.voltaire.com>

Hi Eitan,

On 13:19 Mon 18 Dec     , Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 
> ibutils rev = Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1
> Total=221 Pass=219 Fail=2
> 
> Pass:
> 31 LidMgr IS1-16.topo
> 30 Stability IS1-16.topo
> 30 Pkey IS1-16.topo
> 30 Multicast IS1-16.topo
> 29 OsmStress IS1-16.topo
> 10 Stability IS3-loop.topo
> 10 Stability IS3-128.topo
> 10 Pkey IS3-128.topo
> 10 Multicast IS3-loop.topo
> 10 Multicast IS3-128.topo
> 10 LidMgr IS3-128.topo
> 9 OsmStress IS3-128.topo
> 
> Failures:
> 1 OsmStress IS3-128.topo
> 1 OsmStress IS1-16.topo

Is it possible to have more details about failures (in case when it is
real failures)? Probably to upload the logs to somewhere?

Sasha


From eitan at mellanox.co.il  Mon Dec 18 05:33:50 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 15:33:50 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-18:normal
	completion
Message-ID: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com>

Hi Sasha,

The failure analysis takes time and is manual...
The logs and related files are pretty big and will take space to upload.

Today I simulated with OpenSM that was compiled on the side (my bad -
should have incorporated my patches on the clone but I was not sure this
is not going to "contaminate" that git tree forever) with the fixes for
DONE/DONE_PENDING. 

The tests that failed today are actually false violations:
1. The IS1-16 failed due to lack of free sockets to connect to the
server. Still not clear why. I will increase the number of sockets the
client/server try to connect on.
2. The IS3-128 fail due to temporary replacement of the opensm with the
one that have my fixes for DONE/DONE_PENDING. This was a mistake I did
manually by compiling the "clone". As I was watching the log I have
noticed that the same wrong signal was happening.

BTW: The DONE/DONE_PENDING bug was discovered by a change in simulator
dispatcher that I did. The change introduced a BUG that caused the
machine to be overloaded with busy loop in the simulator dispatcher.
Apparently this brought up some different timing and found these bugs.

EZ

> -----Original Message-----
> From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> Sent: Monday, December 18, 2006 3:31 PM
> To: Eitan Zahavi
> Cc: Eitan Zahavi; Yevgeny Kliteynik; halr at voltaire.com; openib-
> general at openib.org
> Subject: Re: nightly osm_sim report 2006-12-18:normal completion
> 
> Hi Eitan,
> 
> On 13:19 Mon 18 Dec     , Eitan Zahavi wrote:
> > OSM Simulation Regression Summary
> > OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 ibutils rev =
> > Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1
> > Total=221 Pass=219 Fail=2
> >
> > Pass:
> > 31 LidMgr IS1-16.topo
> > 30 Stability IS1-16.topo
> > 30 Pkey IS1-16.topo
> > 30 Multicast IS1-16.topo
> > 29 OsmStress IS1-16.topo
> > 10 Stability IS3-loop.topo
> > 10 Stability IS3-128.topo
> > 10 Pkey IS3-128.topo
> > 10 Multicast IS3-loop.topo
> > 10 Multicast IS3-128.topo
> > 10 LidMgr IS3-128.topo
> > 9 OsmStress IS3-128.topo
> >
> > Failures:
> > 1 OsmStress IS3-128.topo
> > 1 OsmStress IS1-16.topo
> 
> Is it possible to have more details about failures (in case when it is
real
> failures)? Probably to upload the logs to somewhere?
> 
> Sasha


From mst at mellanox.co.il  Mon Dec 18 05:40:10 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 15:40:10 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-18:normal
	completion
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com>
Message-ID: <20061218134010.GE3169@mellanox.co.il>

> should have incorporated my patches on the clone but I was not sure this
> is not going to "contaminate" that git tree forever

No, you can always git rebase to move your patches to the top of the pile,
or just git reset to revert to upstream version.
Just don't do this for a tree someone else might have cloned and based his
development on.

-- 
MST


From philippe_bernadat at hp.com  Mon Dec 18 05:42:21 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Mon, 18 Dec 2006 14:42:21 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <4586919A.7060000@voltaire.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E0557115F@idaexc03.emea.cpqcorp.net>

I have tried both fiexs, none of these  improve performance ... 
Let me check the number of packets again.

Philippe 

> -----Original Message-----
> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> Sent: Monday, December 18, 2006 2:03 PM
> To: Bernadat, Philippe
> Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> Bernadat, Philippe wrote:
> > 3) Is there a way to change the MTU from within the lustre 
> LND kernel
> > module. I saw that the IB perf programs did this with the 
> modify_qp()
> > APIs.
> 
> yes, go to the place where the lustre NLD active side gets 
> RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set
> 
> 	lustre_id->route->path_rec->mtu = IB_MTU_1024;
> 
> Or.
> 
> 
> 
> 


From philippe_bernadat at hp.com  Mon Dec 18 06:09:19 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Mon, 18 Dec 2006 15:09:19 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net>

Or,

I did manage to fix it my way, by inserting this same

route->path_rec->mtu = IB_MTU_1024;

Before/after qp creation
Before/after accept
Before/after connect

So not sure which one really fixes it.

Philippe

 
> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Monday, December 18, 2006 2:42 PM
> To: Or Gerlitz
> Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> openib-general at openib.org
> Subject: RE: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> I have tried both fiexs, none of these  improve performance ... 
> Let me check the number of packets again.
> 
> Philippe 
> 
> > -----Original Message-----
> > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> > Sent: Monday, December 18, 2006 2:03 PM
> > To: Bernadat, Philippe
> > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> > openib-general at openib.org
> > Subject: Re: [openib-general] Performance Degradation with 
> > OFED v. Voltaire(lustre)
> > 
> > Bernadat, Philippe wrote:
> > > 3) Is there a way to change the MTU from within the lustre 
> > LND kernel
> > > module. I saw that the IB perf programs did this with the 
> > modify_qp()
> > > APIs.
> > 
> > yes, go to the place where the lustre NLD active side gets 
> > RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set
> > 
> > 	lustre_id->route->path_rec->mtu = IB_MTU_1024;
> > 
> > Or.
> > 
> > 
> > 
> > 


From wombat2 at us.ibm.com  Mon Dec 18 06:09:56 2006
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Mon, 18 Dec 2006 09:09:56 -0500
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <20061216170328.GB24716@mellanox.co.il>
Message-ID: <OFB669ED7C.48EFDCA0-ON85257248.004C3948-85257248.004DD21F@us.ibm.com>

"Michael S. Tsirkin" <mst at mellanox.co.il> wrote on 12/16/2006 12:03:28 PM:

> > > > >
> > > > > Tried this patch, it didn't work on ehca. I couldn't change 
> the mode from
> > > > > datagram to connected from /sys/class.
> > > >
> > > > It's wroking as designed in that respect.  ehca does not implement
> > > > srq - without
> > > > srq, there is no way to prepost receive buffers for a 
> resonable number of
> > > > connections without running out of memory.
> > > >
> > > > So it is falling back on datagram mode.
> > > > Talk to ehca guys to implement srq and connected mode will be 
enabled.
> > > Don't remember SRQ is a MUST for UC mode. Does this patch support
> > > devices with SRQ in RC mode?
> > 
> > I don't think the IB HCA Spec requires SRQ support for RC but is an 
optional
> > feature. There are two adapters right now that don't support SRQ 
> which means to
> > use IPoIB-CM on them you should make the use of SRQ an option setting.
> 
> No, adding such "drink up all memory on real clusters but run well 
> on a back to back
> benchmark platform" option does not seem like a good idea to me.
> Rather, we should use UD mode to keep IPoIB scalable on all hardware.

I agree that adapters that don't have SRQ can consume larger amounts of 
memory than those with SRQ ,however, that is not a good reason to prevent 
usage of RC or UC on those adapters. The memory consumption problem with 
any protocol not using SRQ and running over RC or UC is well documented. 
At the OpenFabrics meeting in Tampa one of several themes was that we need 
better IP performance to move into commercial customers and also help our 
current primarily HPC customers, some which are not large numbers of 
endpoints configurations. Even thought other ULP's are available, good IP 
is still the opportunity to getting more customers on IB. 

Not all IB customers we have a large number of endpoint deployments so 
having non SRQ adapters use IPoIB-CM is still important to expanding the 
customer base for IB. You have to let the customer decide how they want to 
tune their system based on the available functions/features. If not you 
don't have equality in potential performance across all HCA's. Some 
guidance on memory consumption would be good, to guide users whether they 
want to run IPoIB-CM without SRQ just like IPoIB-CM will be selectable.

> 
> > I agree
> > that if it is available it should be used for scaling issues probably 
if
> > available automatically set. But I would like to see us at least 
support the
> > current hardware that meets the current SPEC.
> 
> SRQ support is clearly optional. But neither is IPoIB CM support a 
required
> feature. Current code will fall back to datagram mode when SRQ is not
> supported, and since UD support in not optional, all current hardware is 
still
> supported with IPoIB - this patch does not break this.
> 
> -- 
> MST


Bernie King-Smith 
IBM Corporation
Server Group
Cluster System Performance 
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES 

"We are not responsible for the world we are born into, only for the world 
we leave when we die.
So we have to accept what has gone before us and work to change the only 
thing we can,
-- The Future." William Shatner
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061218/2b195a3b/attachment.html>

From philippe_bernadat at hp.com  Mon Dec 18 06:19:36 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Mon, 18 Dec 2006 15:19:36 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>

So after a bit more testing, setting the route path mtu to 1024 before
the qp creation (rdma_create_qp()) seems sufficient.
 

Philippe

> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Monday, December 18, 2006 3:09 PM
> To: Bernadat, Philippe; Or Gerlitz
> Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> openib-general at openib.org
> Subject: RE: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> Or,
> 
> I did manage to fix it my way, by inserting this same
> 
> route->path_rec->mtu = IB_MTU_1024;
> 
> Before/after qp creation
> Before/after accept
> Before/after connect
> 
> So not sure which one really fixes it.
> 
> Philippe
> 
>  
> 
> > -----Original Message-----
> > From: Bernadat, Philippe 
> > Sent: Monday, December 18, 2006 2:42 PM
> > To: Or Gerlitz
> > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> > openib-general at openib.org
> > Subject: RE: [openib-general] Performance Degradation with 
> > OFED v. Voltaire(lustre)
> > 
> > I have tried both fiexs, none of these  improve performance ... 
> > Let me check the number of packets again.
> > 
> > Philippe 
> > 
> > > -----Original Message-----
> > > From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> > > Sent: Monday, December 18, 2006 2:03 PM
> > > To: Bernadat, Philippe
> > > Cc: Hal Rosenstock; Eitan Zahavi; Roland Dreier; 
> > > openib-general at openib.org
> > > Subject: Re: [openib-general] Performance Degradation with 
> > > OFED v. Voltaire(lustre)
> > > 
> > > Bernadat, Philippe wrote:
> > > > 3) Is there a way to change the MTU from within the lustre 
> > > LND kernel
> > > > module. I saw that the IB perf programs did this with the 
> > > modify_qp()
> > > > APIs.
> > > 
> > > yes, go to the place where the lustre NLD active side gets 
> > > RDMA_CM_EVENT_ROUTE_RESOLVED event on its rdma cm id and then set
> > > 
> > > 	lustre_id->route->path_rec->mtu = IB_MTU_1024;
> > > 
> > > Or.
> > > 
> > > 
> > > 
> > > 


From wombat2 at us.ibm.com  Mon Dec 18 06:33:04 2006
From: wombat2 at us.ibm.com (Bernard King-Smith)
Date: Mon, 18 Dec 2006 09:33:04 -0500
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <mailman.470.1166310726.18259.openib-general@openib.org>
Message-ID: <OF186626EF.294EBD42-ON85257248.004F2190-85257248.004FF05B@us.ibm.com>

> ----- Message from "Michael S. Tsirkin" <mst at mellanox.co.il> on Sat,
> 16 Dec 2006 18:47:09 +0200 -----
> 
> To:
> 
> "Shirley Ma" <xma at us.ibm.com>
> 
> cc:
> 
> openib-general at openib.org
> 
> Subject:
> 
> Re: [openib-general] [PATCHv2] IPoIB CM Experimental support
> 
> > > > Hi, Michael,
> > > >
> > > > Tried this patch, it didn't work on ehca. I couldn't change 
> the mode from
> > > > datagram to connected from /sys/class.
> > >
> > > It's wroking as designed in that respect.  ehca does not implement
> > > srq - without
> > > srq, there is no way to prepost receive buffers for a resonable 
number of
> > > connections without running out of memory.
> > >
> > > So it is falling back on datagram mode.
> > > Talk to ehca guys to implement srq and connected mode will be 
enabled.
> >
> > Don't remember SRQ is a MUST for UC mode. Does this patch support 
> devices with
> > SRQ in RC mode?
> 
> Yes. Only RC mode is supported by this patch.
> >From what you say I am guessing that SRQ is supported by ehca HW but 
support
> is currently lacking in the ehca driver?

The current EHCA hardware does NOT support SRQ.

> 
> -- 
> MST
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general


Bernie King-Smith 
IBM Corporation
Server Group
Cluster System Performance 
wombat2 at us.ibm.com    (845)433-8483
Tie. 293-8483 or wombat2 on NOTES 

"We are not responsible for the world we are born into, only for the world 
we leave when we die.
So we have to accept what has gone before us and work to change the only 
thing we can,
-- The Future." William Shatner
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061218/3ec13ba7/attachment.html>

From mst at mellanox.co.il  Mon Dec 18 06:46:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 16:46:23 +0200
Subject: [openib-general] [PATCHv2] IPoIB CM Experimental support
In-Reply-To: <OFB669ED7C.48EFDCA0-ON85257248.004C3948-85257248.004DD21F@us.ibm.com>
References: <OFB669ED7C.48EFDCA0-ON85257248.004C3948-85257248.004DD21F@us.ibm.com>
Message-ID: <20061218144623.GF3169@mellanox.co.il>

> I agree that adapters that don't have SRQ can consume larger amounts of memory than those with SRQ ,however, that is not a good reason to prevent usage of RC or UC on those adapters. The memory consumption problem with any protocol not using SRQ and
> running over RC or UC is well documented.

But not solved.

> At the OpenFabrics meeting in Tampa one of several themes was that we need better IP performance to move into commercial customers and also help our current primarily HPC customers, some which are not large numbers
> of endpoints configurations. Even thought other ULP's are available, good IP is still the opportunity to getting more customers on IB. 

That's why you need zero configuration setup that works well on anything
from back-to-back to 1000s of nodes. And this means code that's scalable by
design.

> Not all IB customers we have a large number of endpoint deployments so having
> non SRQ adapters use IPoIB-CM is still important to expanding the customer base
> for IB. You have to let the customer decide how they want to tune their system
> based on the available functions/features.

This just sounds too ugly. I do not *want* to special-case small clusters
precisely because this way big iron flows get no testing.
And people should not "tune" their systems just to
have them basically not run out of memory and crash.

> If not you don't have equality in
> potential performance across all HCA's.

???
It's not *practical* to require equivalent performance on all HCAs.
I just try to do the best I can, and I don't think each trade-off
needs to be turned into a confugiration option.

> Some guidance on memory consumption
> would be good, to guide users whether they want to run IPoIB-CM without SRQ just
> like IPoIB-CM will be selectable. 

I still think falling back to UD mode is the right solution if HCA does not support
SRQ. I just don't see an "ignore scalability issues" option in IPoIB as being
anything but a support nightmare, and having any right to existance outside a
lab.

But - let's see this code land upstream, then code up a patch that is not ugly,
and post it. But IMO time might be better spend adding srq support in ehca.

-- 
MST


From sashak at voltaire.com  Mon Dec 18 07:10:10 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 18 Dec 2006 17:10:10 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-18:normal
	completion
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C980BEFC@mtlexch01.mtl.com>
Message-ID: <20061218151010.GG4808@sashak.voltaire.com>

On 15:33 Mon 18 Dec     , Eitan Zahavi wrote:
> Hi Sasha,
> 
> The failure analysis takes time and is manual...
> The logs and related files are pretty big and will take space to upload.
> 
> Today I simulated with OpenSM that was compiled on the side (my bad -
> should have incorporated my patches on the clone but I was not sure this
> is not going to "contaminate" that git tree forever) with the fixes for
> DONE/DONE_PENDING.

You can commit your changes to the branch, and later to rebase this branch
on top of the new master, something like 'git-rebase master my-branch'.

> The tests that failed today are actually false violations:
> 1. The IS1-16 failed due to lack of free sockets to connect to the
> server. Still not clear why. I will increase the number of sockets the
> client/server try to connect on.
> 2. The IS3-128 fail due to temporary replacement of the opensm with the
> one that have my fixes for DONE/DONE_PENDING. This was a mistake I did
> manually by compiling the "clone". As I was watching the log I have
> noticed that the same wrong signal was happening.

Understood.

> BTW: The DONE/DONE_PENDING bug was discovered by a change in simulator
> dispatcher that I did. The change introduced a BUG that caused the
> machine to be overloaded with busy loop in the simulator dispatcher.
> Apparently this brought up some different timing and found these bugs.

So it was helpful simulator shakes. :)

Thanks for catching this.

BTW, 


> 
> EZ
> 
> > -----Original Message-----
> > From: Sasha Khapyorsky [mailto:sashak at voltaire.com]
> > Sent: Monday, December 18, 2006 3:31 PM
> > To: Eitan Zahavi
> > Cc: Eitan Zahavi; Yevgeny Kliteynik; halr at voltaire.com; openib-
> > general at openib.org
> > Subject: Re: nightly osm_sim report 2006-12-18:normal completion
> > 
> > Hi Eitan,
> > 
> > On 13:19 Mon 18 Dec     , Eitan Zahavi wrote:
> > > OSM Simulation Regression Summary
> > > OpenSM rev = Fri_Dec_15_20:29:07_2006 d5e724 ibutils rev =
> > > Thu_Dec_14_21:48:18_2006 fd82d4 MOD_FILES=1
> > > Total=221 Pass=219 Fail=2
> > >
> > > Pass:
> > > 31 LidMgr IS1-16.topo
> > > 30 Stability IS1-16.topo
> > > 30 Pkey IS1-16.topo
> > > 30 Multicast IS1-16.topo
> > > 29 OsmStress IS1-16.topo
> > > 10 Stability IS3-loop.topo
> > > 10 Stability IS3-128.topo
> > > 10 Pkey IS3-128.topo
> > > 10 Multicast IS3-loop.topo
> > > 10 Multicast IS3-128.topo
> > > 10 LidMgr IS3-128.topo
> > > 9 OsmStress IS3-128.topo
> > >
> > > Failures:
> > > 1 OsmStress IS3-128.topo
> > > 1 OsmStress IS1-16.topo
> > 
> > Is it possible to have more details about failures (in case when it is
> real
> > failures)? Probably to upload the logs to somewhere?
> > 
> > Sasha


From halr at voltaire.com  Mon Dec 18 07:11:58 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 10:11:58 -0500
Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending
 transaction of Light Sweep
In-Reply-To: <45844167.9060302@mellanox.co.il>
References: <45844167.9060302@mellanox.co.il>
Message-ID: <1166454599.32666.185925.camel@hal.voltaire.com>

Hi Eitan,

On Sat, 2006-12-16 at 13:56, Eitan Zahavi wrote:
> Hi Hal
> 
> This patch provides fixes an issue discovered by the nightly regression.
> OpenSM state machine got stack due to pending SwitchInfo transaction 
> being ignored since one of the queries for SwitchInfo
> failed (due to bad-link).
> The patch below simply avoids aborting the wait for all SwitchInfo 
> requests to return.
> 
> I think this issue might have hurt us in other situations too sine it 
> aborted the wait on "CHANGE DETECTED" too.
> CHANGE_DETECTED is fired on the first switch that reported "Change Bit".
> 
> It is possible that the issue is showing up as we added incremental 
> support (e.g. for routing)
> Since only of there are no other SMP's sent during the heavy sweep we 
> will get the
> "NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests

So is the same issue applicable to OFED 1.1 ?

> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il >
> 
>  osm/opensm/osm_state_mgr.c |    5 ++---
>  1 files changed, 2 insertions(+), 3 deletions(-)

Thanks. Applied.

-- Hal


From bos at pathscale.com  Mon Dec 18 07:22:17 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Mon, 18 Dec 2006 07:22:17 -0800
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
References: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
Message-ID: <4586B229.5050300@pathscale.com>

Hoang-Nam Nguyen wrote:

> Good point. I can speak for ehca only. We prefer to reuse existing
> errno values and not to define new ones as it's also a question of
> how much information we want to tell the consumer in case of error
> and what it can handle for.

This is independent of the question of whether to return -1 or -errno to 
indicate an error in userspace.  The standard in userspace has long been 
to return -1, with the error code propagated through the errno 
pseudo-variable.  It's a crummy convention, but it's at least consistent 
with the rest of userspace.

(By the way, libipathverbs just propagates error codes up from 
libibverbs.  It doesn't generate any new numeric return values of its own.)

	<b


From bos at pathscale.com  Mon Dec 18 07:24:33 2006
From: bos at pathscale.com (Bryan O'Sullivan)
Date: Mon, 18 Dec 2006 07:24:33 -0800
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <45867FBD.9040300@dev.mellanox.co.il>
References: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
	<45867FBD.9040300@dev.mellanox.co.il>
Message-ID: <4586B2B1.2060908@pathscale.com>

Dotan Barak wrote:

> I think that there should be 2 modes to the drivers:
> mode 1 (release mode): return "standard" errno values
> mode 2 (debug mode) : return "IB oriented" values

No way, that's a guaranteed route to broken code.  If you want to 
propagate IB-specific error values, define an ib_errno variable, make it 
use the same TLS mechanism as errno, give it well-defined values, and 
make it part of the ABI.  Some mechanism that you can't rely on unless 
you know you need to tweak it is worse than useless.

	<b


From dotanb at dev.mellanox.co.il  Mon Dec 18 07:52:09 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Mon, 18 Dec 2006 17:52:09 +0200
Subject: [openib-general] Different low level drivers returns different
 return values incase of an error
In-Reply-To: <4586B2B1.2060908@pathscale.com>
References: <OF790BC847.6E4E20CC-ONC1257248.00323739-C1257248.00337FA2@de.ibm.com>
	<45867FBD.9040300@dev.mellanox.co.il> <4586B2B1.2060908@pathscale.com>
Message-ID: <4586B929.9080306@dev.mellanox.co.il>

Bryan O'Sullivan wrote:
> Dotan Barak wrote:
>
>> I think that there should be 2 modes to the drivers:
>> mode 1 (release mode): return "standard" errno values
>> mode 2 (debug mode) : return "IB oriented" values
>
> No way, that's a guaranteed route to broken code.  If you want to 
> propagate IB-specific error values, define an ib_errno variable, make 
> it use the same TLS mechanism as errno, give it well-defined values, 
> and make it part of the ABI.  Some mechanism that you can't rely on 
> unless you know you need to tweak it is worse than useless.
>
>     <b
This was an example for a possible solution on how to give the user more 
info when there is a failure
(I'm sure that we can come out with a better solution that will be 
accepted by everyone ..)

Dotan


From halr at voltaire.com  Mon Dec 18 08:07:43 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 11:07:43 -0500
Subject: [openib-general] [PATCH] osm: fix bugs related to not passing
 OSM_SIGNAL_DONE_PENDING
In-Reply-To: <45846F4C.4080501@mellanox.co.il>
References: <45846F4C.4080501@mellanox.co.il>
Message-ID: <1166458043.32666.188439.camel@hal.voltaire.com>

Hi Eitan,

On Sat, 2006-12-16 at 17:12, Eitan Zahavi wrote:
> Hi Hal
> 
> This set of patches fixes issues of not providing back to state manager 
> OSM_SIGNAL_DONE_PENDING
> which breaks the state machine later in the sweep.
> 
> Eitan
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
>  osm/opensm/osm_pkey_mgr.c  |  112 
> ++++++++++++++++++++++++++++++++------------

This patch (here and other places) appear to be line wrapped.

> osm/opensm/osm_state_mgr.c |   11 +++--
>  osm/opensm/osm_ucast_mgr.c |   96 ++++++++++++++++++++++++--------------
>  4 files changed, 179 insertions(+), 88 deletions(-)

Is this patch 4 files or 3 ? (How was this patch generated ?)

Is this one patch or should it be 2 or 3 ? It looks to me there is an
incremental change to osm_state_mgr.c and perhaps 2 other ones which can
be separate (pkey and ucast_mgr).

Also, see below in osm_state_mgr.c for another minor comment.

> diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
> index 48837bc..a33aec7 100644
> --- a/osm/opensm/osm_pkey_mgr.c
> +++ b/osm/opensm/osm_pkey_mgr.c
> @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
>  
>  /**********************************************************************
>   **********************************************************************/
> -static ib_api_status_t
> +static boolean_t
>  pkey_mgr_enforce_partition(
> +  IN osm_log_t *p_log,
>    IN const osm_req_t *p_req,
>    IN const osm_physp_t *p_physp,
>    IN const boolean_t enforce)
> @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
>    osm_madw_context_t context;
>    uint8_t payload[IB_SMP_DATA_SIZE];
>    ib_port_info_t *p_pi;
> +  ib_api_status_t status;
>  
>    if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
> -    return IB_ERROR;
> +  {
> +     osm_log( p_log, OSM_LOG_ERROR,
> +              "pkey_mgr_enforce_partition: ERR 0507: "
> +              "No port info for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +     return FALSE;
> +  }
>  
> -  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
> -    return IB_SUCCESS;
> +  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
> +  {
> +     osm_log( p_log, OSM_LOG_DEBUG,
> +              "pkey_mgr_enforce_partition: "
> +              "No need to update PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +    return FALSE;
> +  }
>  
>    memset( payload, 0, IB_SMP_DATA_SIZE );
>    memcpy( payload, p_pi, sizeof(ib_port_info_t) );
> @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
>    context.pi_context.light_sweep = FALSE;
>    context.pi_context.active_transition = FALSE;
>  
> -  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
> -                      payload, sizeof(payload),
> -                      IB_MAD_ATTR_PORT_INFO,
> -                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
> -                      CL_DISP_MSGID_NONE, &context );
> +  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
> +        payload, sizeof(payload),
> +        IB_MAD_ATTR_PORT_INFO,
> +        cl_hton32( osm_physp_get_port_num( p_physp ) ),
> +        CL_DISP_MSGID_NONE, &context );
> +  if (status != IB_SUCCESS)
> +  {
> +     osm_log( p_log, OSM_LOG_ERROR,
> +              "pkey_mgr_enforce_partition: ERR 0520: "
> +              "Failed to set PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +     return FALSE;
> +  }
> +  else
> +  {
> +     osm_log( p_log, OSM_LOG_DEBUG,
> +              "pkey_mgr_enforce_partition: "
> +              "Set PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +   return TRUE;
> +  }
>  }
>  
>  /**********************************************************************
> @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
>  
>      status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
> block_index );
>      if (status == IB_SUCCESS)
> -      ret_val = TRUE;
> +  {
> +   osm_log( p_log, OSM_LOG_DEBUG,
> +      "pkey_mgr_update_port: "
> +      "Updated "
> +      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> +      block_index,
> +      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +      osm_physp_get_port_num( p_physp ) );
> +   ret_val = TRUE;
> +  }
>      else
> -      osm_log( p_log, OSM_LOG_ERROR,
> -        "pkey_mgr_update_port: ERR 0506: "
> -        "pkey_mgr_update_pkey_entry() failed to update "
> -        "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> -        block_index,
> -        cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> -        osm_physp_get_port_num( p_physp ) );
> +  {
> +   osm_log( p_log, OSM_LOG_ERROR,
> +      "pkey_mgr_update_port: ERR 0506: "
> +      "pkey_mgr_update_pkey_entry() failed to update "
> +      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> +      block_index,
> +      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +      osm_physp_get_port_num( p_physp ) );
> +  }
>    }
>  
>    return ret_val;
> @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
>    uint16_t peer_max_blocks;
>    ib_api_status_t status = IB_SUCCESS;
>    boolean_t ret_val = FALSE;
> +  boolean_t port_info_set = FALSE;
>    ib_pkey_table_t empty_block;
> -
> + 
>    memset(&empty_block, 0, sizeof(ib_pkey_table_t));
>  
>    p_physp = osm_port_get_default_phys_ptr( p_port );
> @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
>      enforce = FALSE;
>    }
>  
> -  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
> -  {
> -    osm_log( p_log, OSM_LOG_ERROR,
> -      "pkey_mgr_update_peer_port: ERR 0507: "
> -      "pkey_mgr_enforce_partition() failed to update "
> -      "node 0x%016" PRIx64 " port %u\n",
> -      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> -      osm_physp_get_port_num( peer ) );
> -  }
> +  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
> +   port_info_set = TRUE;
>  
>    if (enforce == FALSE)
> -    return FALSE;
> +  return port_info_set;
>  
>    p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
>    for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
> block_index++)
> @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
>               osm_physp_get_port_num( peer ) );
>    }
>  
> +  if (port_info_set) return TRUE;
>    return ret_val;
>  }
>  
> @@ -541,10 +593,10 @@ osm_pkey_mgr_process(
>        signal = OSM_SIGNAL_DONE_PENDING;
>      p_node = osm_port_get_parent_node( p_port );
>      if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
> -  pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
> +   pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
>          &p_osm->subn, p_port,
>          !p_osm->subn.opt.no_partition_enforcement ) )
> -      signal = OSM_SIGNAL_DONE_PENDING;       
> +      signal = OSM_SIGNAL_DONE_PENDING;
>    }
>  
>   _err:
> diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
> index 9eac038..4e61259 100644
> --- a/osm/opensm/osm_state_mgr.c
> +++ b/osm/opensm/osm_state_mgr.c
> @@ -1853,6 +1853,7 @@ osm_state_mgr_process(
>  {
>     ib_api_status_t status;
>     osm_remote_sm_t *p_remote_sm;
> + osm_signal_t tmp_signal;
>  
>     CL_ASSERT( p_mgr );
>  
> @@ -2075,11 +2076,10 @@ osm_state_mgr_process(
>           case OSM_SIGNAL_CHANGE_DETECTED:
>              /*
>               * Nothing to do here.  One subnet change typcially
> -             * begets another....
> +             * begets another.... But needs to wait for all transactions
>               */
>              signal = OSM_SIGNAL_NONE;
> -            break;
> -

This is a repeat of your previous submitted patch to this file so isn't
needed.

-- Hal

> +    break;
>           case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
>              /*
>               * A change was detected on the subnet.
> @@ -2219,7 +2219,10 @@ osm_state_mgr_process(
>              signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
>  
>              /* the returned signal is always DONE */
> -            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> +            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> +
> +    if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
> +     signal = OSM_SIGNAL_DONE_PENDING;
>  
>              /* try to restore SA DB (this should be before lid_mgr
>                 because we may want to disable clients reregistration
> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index e977253..39973de 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c
> @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table(
>    ib_switch_info_t si;
>    uint32_t block_id_ho = 0;
>    uint8_t block[IB_SMP_DATA_SIZE];
> +  boolean_t set_swinfo_require = FALSE;
> +  uint16_t lin_top;
> +  uint8_t life_state;
>  
>    CL_ASSERT( p_mgr );
>  
> @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table(
>      Set the top of the unicast forwarding table.
>    */
>    si = *osm_switch_get_si_ptr( p_sw );
> -  si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
> +  lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
> +  if (si.lin_top != lin_top)
> +  {
> +   set_swinfo_require = TRUE;
> +      si.lin_top  = lin_top;
> +  }
>  
>    /* check to see if the change state bit is on. If it is - then we
>       need to clear it. */
> -   if( ib_switch_info_get_state_change( &si ) )
> -    si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
> -                      | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
> +  if ( ib_switch_info_get_state_change( &si ) )
> +      life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
> +                          | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
>    else
> -    si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
> +      life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
>  
> -  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
> +  if (life_state != si.life_state)
>    {
> -    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
> -             "osm_ucast_mgr_set_fwd_table: "
> -             "Setting switch FT top to LID 0x%X\n",
> -             osm_switch_get_max_lid_ho( p_sw ) );
> +      set_swinfo_require = TRUE;
> +      si.life_state = life_state;
>    }
> -
> -  context.si_context.light_sweep = FALSE;
> -  context.si_context.node_guid = osm_node_get_node_guid( p_node );
> -  context.si_context.set_method = TRUE;
> -
> -  status = osm_req_set( p_mgr->p_req,
> -                        p_path,
> -                        (uint8_t*)&si,
> -                        sizeof(si),
> -                        IB_MAD_ATTR_SWITCH_INFO,
> -                        0,
> -                        CL_DISP_MSGID_NONE,
> -                        &context );
> -
> -  if( status != IB_SUCCESS )
> + 
> +  if ( set_swinfo_require )
>    {
> -    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
> -             "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
> -             "Sending SwitchInfo attribute failed (%s)\n",
> -             ib_get_err_str( status ) );
> +      if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
> +      {
> +          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
> +                      "osm_ucast_mgr_set_fwd_table: "
> +                      "Setting switch FT top to LID 0x%X\n",
> +                      osm_switch_get_max_lid_ho( p_sw ) );
> +      }
> +     
> +      context.si_context.light_sweep = FALSE;
> +      context.si_context.node_guid = osm_node_get_node_guid( p_node );
> +      context.si_context.set_method = TRUE;
> +     
> +      status = osm_req_set( p_mgr->p_req,
> +                                    p_path,
> +                                    (uint8_t*)&si,
> +                                    sizeof(si),
> +                                    IB_MAD_ATTR_SWITCH_INFO,
> +                                    0,
> +                                    CL_DISP_MSGID_NONE,
> +                                    &context );
> +     
> +      if( status != IB_SUCCESS )
> +      {
> +          osm_log( p_mgr->p_log, OSM_LOG_ERROR,
> +                      "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
> +                      "Sending SwitchInfo attribute failed (%s)\n",
> +                      ib_get_err_str( status ) );
> +      }
> +      else
> +          p_mgr->any_change = TRUE;
>    }
>  
>    /*
> @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process(
>  
>    CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
>  
> +  p_mgr->any_change = FALSE;
> +
>    /*
>      If there are no switches in the subnet, we are done.
>    */
>    if (cl_qmap_count( p_sw_guid_tbl ) == 0)
>      goto Exit;
>  
> -  p_mgr->any_change = FALSE;
>    cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
>  
>    if (!p_routing_eng->build_lid_matrices ||
> @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process(
>    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
>      __osm_ucast_mgr_dump_tables( p_mgr );
>  
> -  if (p_mgr->any_change)
> +  if (p_mgr->any_change)
> +  {
>       signal = OSM_SIGNAL_DONE_PENDING;
> +      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
> +                 "osm_ucast_mgr_process: "
> +                 "LFT Tables configured on all switches\n");
> +  }
>    else
> +  {
> +      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
> +                 "osm_ucast_mgr_process: "
> +                 "No need to set any LFT Tables on all switches\n");
>       signal = OSM_SIGNAL_DONE;
> -
> -  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
> -          "osm_ucast_mgr_process: "
> -          "LFT Tables configured on all switches\n");
> +  }
>  
>   Exit:
>    CL_PLOCK_RELEASE( p_mgr->p_lock );
> 
> 


From yosefe at voltaire.com  Mon Dec 18 08:21:21 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 18 Dec 2006 18:21:21 +0200
Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa
	unload
Message-ID: <1166458881.9289.17.camel@muscida>

This is a fix to Sean's multicast patches for ofed 1.2.

The issuse is described in: 
http://www.mail-archive.com/openib-general at openib.org/msg27097.html

The Oops happened because the multicast work handler was called
after the multicast device structure was released. It happened because
the multicast cleanup function 'mcast_remove_one' didn't wait for
work queue completion on all ports before releasing the device, but 
only N-1 ports.

The patch applies after Sean's multicast patch series.

---
 multicast.c |    2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)
 
diff --git a/drivers/infiniband/core/multicast.c
b/drivers/infiniband/core/multicast.c
index a8ff6fa..4e15fd3 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -822,7 +822,7 @@ static void mcast_remove_one(struct ib_d
 	ib_unregister_event_handler(&event_handler);
 	flush_workqueue(mcast_wq);
 
-	for (i = 0; i < dev->end_port - dev->start_port; i++) {
+	for (i = 0; i <= dev->end_port - dev->start_port; i++) {
 		port = &dev->port[i];
 		deref_port(port);
 		wait_for_completion(&port->comp);
--

Yosef Etigin
yosefe at voltaire.com


From mst at mellanox.co.il  Mon Dec 18 08:41:00 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 18:41:00 +0200
Subject: [openib-general] open-iscsi update for OFED 1.2
In-Reply-To: <457D2569.2000805@voltaire.com>
References: <457D2569.2000805@voltaire.com>
Message-ID: <20061218164100.GB24076@mellanox.co.il>

> In order to create backport patches to a
> specific distro, I need to know where I start from (i.e which kernel
> version).

I have just pulled 2.6.20-rc1 into OFED kernel tree.
Most likely, there won't be major API changes before 2.6.21.
So if that's what you were waiting for, go ahead, clone ~vlad/ofed_1_2/.git
and start working on the backports.
Try to use the kernel_addons infrastructure as much as possible (it's much
easier to maintain) where not you can still use kernel_patches/backports
as in OFED 1.1.

At your request, Vlad added checking out iscsi to ~vlad/ofabuild.git,
and I expect it does not build on any kernel older than 2.6.20-rc1.

Vlad here shall be able to help with any questions on OFED build scripts.

-- 
MST


From jriotto at cisco.com  Mon Dec 18 08:53:10 2006
From: jriotto at cisco.com (Jamie Riotto (jriotto))
Date: Mon, 18 Dec 2006 08:53:10 -0800
Subject: [openib-general] EWG Call Info for Dec 18, 2006
Message-ID: <944AD9DA9232E346ADF590C41BFFEC410325BF9E@xmb-sjc-232.amer.cisco.com>

Date/Time:               DEC 4, 2006 at 12:00PM America/New_York 
Length:                  60 
Frequency:               10 
Meeting ID:              2106670 
Meeting Password:        

Global Access Numbers: 
http://cisco.com/en/US/about/doing_business/conferencing/index.html
<http://cisco.com/en/US/about/doing_business/conferencing/index.html>  

    US/Canada:  +1.866.432.9903    United Kingdom:   +44.20.8824.0117 
    India:      +91.80.4103.3979   Germany:          +49.619.6773.9002 
    Japan:      +81.3.5763.9394    China:            +86.10.8515.5666 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061218/0687ca69/attachment.html>

From mst at mellanox.co.il  Mon Dec 18 08:59:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 18 Dec 2006 18:59:48 +0200
Subject: [openib-general] OFED 1.2 updated to 2.6.20-rc1, UCMA in
Message-ID: <20061218165948.GC24076@mellanox.co.il>

OK, I pulled v2.6.20-rc1 from linus (actually, the immediately following
d1998ef38 which fixes a compilation issue in ib_verbs.h) into ofed 1.2 tree.
I'll continue to pull with each RC but I expect no more major API changes.

Main things that had to be backported:
- ilog API f0d1b0b30d250a07627ad8b9fbbb5c7cc08422e8
- kmemdup API (actually introduced in 2.6.19: bed8bdfddd851657cf9e5fd16bb44abb02ae7f42)
- work_struct related changes c4028958b6ecad064b1a6303a6a5906d4fe48d73

All backports have been updated to work with that kernel version,
so if you have based you development on that, please fetch and
rebase as appropriate.

This brings in the UCMA module (sans the the multicast patches),
so it should be now possible to run userspace that relies
on UCMA on OFED 1.2 kernel code.

Wrt multicast:
Sean, could you please prepare multicast tree based on 2.6.20-rc1
(+hopefully recent fixes) so I can test that with OFED?

Wrt iser:
please note that due to recent decision to include the iscsi module
in OFED 1.2, iser can't built on older kernels until someone
(presumably from voltaire) clones ofed_1.2 and looks into backporting
iscsi too.

I'll be off Wed/Thursday, so please Cc Vlad on any questions/issues.

Thanks,

-- 
MST


From dabeisein at konzept06.net  Sat Dec 16 06:55:46 2006
From: dabeisein at konzept06.net (Konzept 2006)
Date: Sat, 16 Dec 2006 15:55:46 +0100
Subject: [openib-general] =?iso-8859-1?q?attraktiver_Gesch=E4ftsplan?=
Message-ID: <2006121615554618AA2F95CD$A470CEAB08@PC>

Guten Tag,
 
bevor Sie diese E-Mail ad acta legen, sollten Sie Eines wissen: Hierbei handelt es sich nicht um Spam oder sonstigen Unfug!
 
Ich schreibe Ihnen diese Email, um Ihnen einen attraktiven Geschäftsplan vorzustellen.
 
Wieso? Werden Sie sich in diesem Augenblick sicherlich fragen. Weil man mit dieser neuen Geschäftsmethode gemeinsam eine hohe Summe an Bargeld verdienen kann. Ich gehöre schon zu denjenigen, welche diesen Geschäftsplan bereits erfolgreich betreiben und Sie werden auch dazu gehören. Dieses Konzept hat es in dieser Form noch nicht gegeben und JEDER kann daran teilnehmen. Es gibt keinen schnelleren, sichereren und einfacheren Weg bares Geld absolut legal zu erwirtschaften. Alles was Sie brauchen, sind 30 Minuten, um das, was in diesem Plan geschrieben steht, umzusetzen. Ich versichere Ihnen, dieser Plan ist absolut unverbindlich und durch seine ergebnisorientierte Ausführung mehrfach ausgezeichnet.
 
Ich bin mir vollkommen sicher, dass Sie nach dem Durchlesen dieses Konzeptes genauso begeistert sein werden wie viele andere es sind, da es bereits funktioniert hat. 
 
Nehmen Sie sich also 30 Minuten Zeit, machen Sie es sich auf ihrem Sessel oder Sofa gemütlich, holen Sie sich etwas zu knabbern und dann fangen Sie an, das Ihnen vorliegende Konzept umzusetzen.
 
Falls dieses Konzept keine ansprechende Wirkung auf Sie hat, entschuldige ich mich für die eventuellen Unannehmlichkeiten. Ich respektiere ihre Entscheidung und wünsche Ihnen für die Zukunft viel Erfolg, aber denken Sie wenigstens darüber nach, andernfalls sind Sie im Begriff eine Menge Bargeld wegzuwerfen.
 
Danke, und alles Gute für die Zukunft
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061216/7a730191/attachment.html>

From eitan at mellanox.co.il  Mon Dec 18 11:35:13 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 21:35:13 +0200
Subject: [openib-general] [PATCH] osm: fix bugs related to not passing
 OSM_SIGNAL_DONE_PENDING
In-Reply-To: <1166458043.32666.188439.camel@hal.voltaire.com>
References: <45846F4C.4080501@mellanox.co.il>
	<1166458043.32666.188439.camel@hal.voltaire.com>
Message-ID: <4586ED71.6000801@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sat, 2006-12-16 at 17:12, Eitan Zahavi wrote:
>   
>> Hi Hal
>>
>> This set of patches fixes issues of not providing back to state manager 
>> OSM_SIGNAL_DONE_PENDING
>> which breaks the state machine later in the sweep.
>>
>> Eitan
>>
>> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
>>
>>  osm/opensm/osm_pkey_mgr.c  |  112 
>> ++++++++++++++++++++++++++++++++------------
>>     
>
> This patch (here and other places) appear to be line wrapped.
>   
Sorry about that - I did cut and paste. I will never do that again.

>   
>> osm/opensm/osm_state_mgr.c |   11 +++--
>>  osm/opensm/osm_ucast_mgr.c |   96 ++++++++++++++++++++++++--------------
>>  4 files changed, 179 insertions(+), 88 deletions(-)
>>     
>
> Is this patch 4 files or 3 ? (How was this patch generated ?)
>   
I did remove part of the patch since I already sent it separately.
> Is this one patch or should it be 2 or 3 ? It looks to me there is an
> incremental change to osm_state_mgr.c and perhaps 2 other ones which can
> be separate (pkey and ucast_mgr).
>   
I probably messed up the patch and can not tell. I will resend this 
patch after pulling from trunk again.
> Also, see below in osm_state_mgr.c for another minor comment.
>
>   
>> diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
>> index 48837bc..a33aec7 100644
>> --- a/osm/opensm/osm_pkey_mgr.c
>> +++ b/osm/opensm/osm_pkey_mgr.c
>> @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
>>  
>>  /**********************************************************************
>>   **********************************************************************/
>> -static ib_api_status_t
>> +static boolean_t
>>  pkey_mgr_enforce_partition(
>> +  IN osm_log_t *p_log,
>>    IN const osm_req_t *p_req,
>>    IN const osm_physp_t *p_physp,
>>    IN const boolean_t enforce)
>> @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
>>    osm_madw_context_t context;
>>    uint8_t payload[IB_SMP_DATA_SIZE];
>>    ib_port_info_t *p_pi;
>> +  ib_api_status_t status;
>>  
>>    if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
>> -    return IB_ERROR;
>> +  {
>> +     osm_log( p_log, OSM_LOG_ERROR,
>> +              "pkey_mgr_enforce_partition: ERR 0507: "
>> +              "No port info for "
>> +              "node 0x%016" PRIx64 " port %u\n",
>> +              cl_ntoh64(
>> +                 osm_node_get_node_guid(
>> +                    osm_physp_get_node_ptr( p_physp ))),
>> +              osm_physp_get_port_num( p_physp ) );
>> +     return FALSE;
>> +  }
>>  
>> -  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
>> -    return IB_SUCCESS;
>> +  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
>> +  {
>> +     osm_log( p_log, OSM_LOG_DEBUG,
>> +              "pkey_mgr_enforce_partition: "
>> +              "No need to update PortInfo for "
>> +              "node 0x%016" PRIx64 " port %u\n",
>> +              cl_ntoh64(
>> +                 osm_node_get_node_guid(
>> +                    osm_physp_get_node_ptr( p_physp ))),
>> +              osm_physp_get_port_num( p_physp ) );
>> +    return FALSE;
>> +  }
>>  
>>    memset( payload, 0, IB_SMP_DATA_SIZE );
>>    memcpy( payload, p_pi, sizeof(ib_port_info_t) );
>> @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
>>    context.pi_context.light_sweep = FALSE;
>>    context.pi_context.active_transition = FALSE;
>>  
>> -  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
>> -                      payload, sizeof(payload),
>> -                      IB_MAD_ATTR_PORT_INFO,
>> -                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
>> -                      CL_DISP_MSGID_NONE, &context );
>> +  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
>> +        payload, sizeof(payload),
>> +        IB_MAD_ATTR_PORT_INFO,
>> +        cl_hton32( osm_physp_get_port_num( p_physp ) ),
>> +        CL_DISP_MSGID_NONE, &context );
>> +  if (status != IB_SUCCESS)
>> +  {
>> +     osm_log( p_log, OSM_LOG_ERROR,
>> +              "pkey_mgr_enforce_partition: ERR 0520: "
>> +              "Failed to set PortInfo for "
>> +              "node 0x%016" PRIx64 " port %u\n",
>> +              cl_ntoh64(
>> +                 osm_node_get_node_guid(
>> +                    osm_physp_get_node_ptr( p_physp ))),
>> +              osm_physp_get_port_num( p_physp ) );
>> +     return FALSE;
>> +  }
>> +  else
>> +  {
>> +     osm_log( p_log, OSM_LOG_DEBUG,
>> +              "pkey_mgr_enforce_partition: "
>> +              "Set PortInfo for "
>> +              "node 0x%016" PRIx64 " port %u\n",
>> +              cl_ntoh64(
>> +                 osm_node_get_node_guid(
>> +                    osm_physp_get_node_ptr( p_physp ))),
>> +              osm_physp_get_port_num( p_physp ) );
>> +   return TRUE;
>> +  }
>>  }
>>  
>>  /**********************************************************************
>> @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
>>  
>>      status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
>> block_index );
>>      if (status == IB_SUCCESS)
>> -      ret_val = TRUE;
>> +  {
>> +   osm_log( p_log, OSM_LOG_DEBUG,
>> +      "pkey_mgr_update_port: "
>> +      "Updated "
>> +      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
>> +      block_index,
>> +      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>> +      osm_physp_get_port_num( p_physp ) );
>> +   ret_val = TRUE;
>> +  }
>>      else
>> -      osm_log( p_log, OSM_LOG_ERROR,
>> -        "pkey_mgr_update_port: ERR 0506: "
>> -        "pkey_mgr_update_pkey_entry() failed to update "
>> -        "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
>> -        block_index,
>> -        cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>> -        osm_physp_get_port_num( p_physp ) );
>> +  {
>> +   osm_log( p_log, OSM_LOG_ERROR,
>> +      "pkey_mgr_update_port: ERR 0506: "
>> +      "pkey_mgr_update_pkey_entry() failed to update "
>> +      "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
>> +      block_index,
>> +      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>> +      osm_physp_get_port_num( p_physp ) );
>> +  }
>>    }
>>  
>>    return ret_val;
>> @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
>>    uint16_t peer_max_blocks;
>>    ib_api_status_t status = IB_SUCCESS;
>>    boolean_t ret_val = FALSE;
>> +  boolean_t port_info_set = FALSE;
>>    ib_pkey_table_t empty_block;
>> -
>> + 
>>    memset(&empty_block, 0, sizeof(ib_pkey_table_t));
>>  
>>    p_physp = osm_port_get_default_phys_ptr( p_port );
>> @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
>>      enforce = FALSE;
>>    }
>>  
>> -  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
>> -  {
>> -    osm_log( p_log, OSM_LOG_ERROR,
>> -      "pkey_mgr_update_peer_port: ERR 0507: "
>> -      "pkey_mgr_enforce_partition() failed to update "
>> -      "node 0x%016" PRIx64 " port %u\n",
>> -      cl_ntoh64( osm_node_get_node_guid( p_node ) ),
>> -      osm_physp_get_port_num( peer ) );
>> -  }
>> +  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
>> +   port_info_set = TRUE;
>>  
>>    if (enforce == FALSE)
>> -    return FALSE;
>> +  return port_info_set;
>>  
>>    p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
>>    for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
>> block_index++)
>> @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
>>               osm_physp_get_port_num( peer ) );
>>    }
>>  
>> +  if (port_info_set) return TRUE;
>>    return ret_val;
>>  }
>>  
>> @@ -541,10 +593,10 @@ osm_pkey_mgr_process(
>>        signal = OSM_SIGNAL_DONE_PENDING;
>>      p_node = osm_port_get_parent_node( p_port );
>>      if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
>> -  pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
>> +   pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
>>          &p_osm->subn, p_port,
>>          !p_osm->subn.opt.no_partition_enforcement ) )
>> -      signal = OSM_SIGNAL_DONE_PENDING;       
>> +      signal = OSM_SIGNAL_DONE_PENDING;
>>    }
>>  
>>   _err:
>> diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
>> index 9eac038..4e61259 100644
>> --- a/osm/opensm/osm_state_mgr.c
>> +++ b/osm/opensm/osm_state_mgr.c
>> @@ -1853,6 +1853,7 @@ osm_state_mgr_process(
>>  {
>>     ib_api_status_t status;
>>     osm_remote_sm_t *p_remote_sm;
>> + osm_signal_t tmp_signal;
>>  
>>     CL_ASSERT( p_mgr );
>>  
>> @@ -2075,11 +2076,10 @@ osm_state_mgr_process(
>>           case OSM_SIGNAL_CHANGE_DETECTED:
>>              /*
>>               * Nothing to do here.  One subnet change typcially
>> -             * begets another....
>> +             * begets another.... But needs to wait for all transactions
>>               */
>>              signal = OSM_SIGNAL_NONE;
>> -            break;
>> -
>>     
>
> This is a repeat of your previous submitted patch to this file so isn't
> needed.
>
>   
Yes I will resend.
> -- Hal
>
>   
>> +    break;
>>           case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
>>              /*
>>               * A change was detected on the subnet.
>> @@ -2219,7 +2219,10 @@ osm_state_mgr_process(
>>              signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
>>  
>>              /* the returned signal is always DONE */
>> -            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
>> +            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
>> +
>> +    if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
>> +     signal = OSM_SIGNAL_DONE_PENDING;
>>  
>>              /* try to restore SA DB (this should be before lid_mgr
>>                 because we may want to disable clients reregistration
>> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
>> index e977253..39973de 100644
>> --- a/osm/opensm/osm_ucast_mgr.c
>> +++ b/osm/opensm/osm_ucast_mgr.c
>> @@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table(
>>    ib_switch_info_t si;
>>    uint32_t block_id_ho = 0;
>>    uint8_t block[IB_SMP_DATA_SIZE];
>> +  boolean_t set_swinfo_require = FALSE;
>> +  uint16_t lin_top;
>> +  uint8_t life_state;
>>  
>>    CL_ASSERT( p_mgr );
>>  
>> @@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table(
>>      Set the top of the unicast forwarding table.
>>    */
>>    si = *osm_switch_get_si_ptr( p_sw );
>> -  si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
>> +  lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
>> +  if (si.lin_top != lin_top)
>> +  {
>> +   set_swinfo_require = TRUE;
>> +      si.lin_top  = lin_top;
>> +  }
>>  
>>    /* check to see if the change state bit is on. If it is - then we
>>       need to clear it. */
>> -   if( ib_switch_info_get_state_change( &si ) )
>> -    si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
>> -                      | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
>> +  if ( ib_switch_info_get_state_change( &si ) )
>> +      life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
>> +                          | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
>>    else
>> -    si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
>> +      life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
>>  
>> -  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
>> +  if (life_state != si.life_state)
>>    {
>> -    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>> -             "osm_ucast_mgr_set_fwd_table: "
>> -             "Setting switch FT top to LID 0x%X\n",
>> -             osm_switch_get_max_lid_ho( p_sw ) );
>> +      set_swinfo_require = TRUE;
>> +      si.life_state = life_state;
>>    }
>> -
>> -  context.si_context.light_sweep = FALSE;
>> -  context.si_context.node_guid = osm_node_get_node_guid( p_node );
>> -  context.si_context.set_method = TRUE;
>> -
>> -  status = osm_req_set( p_mgr->p_req,
>> -                        p_path,
>> -                        (uint8_t*)&si,
>> -                        sizeof(si),
>> -                        IB_MAD_ATTR_SWITCH_INFO,
>> -                        0,
>> -                        CL_DISP_MSGID_NONE,
>> -                        &context );
>> -
>> -  if( status != IB_SUCCESS )
>> + 
>> +  if ( set_swinfo_require )
>>    {
>> -    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
>> -             "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
>> -             "Sending SwitchInfo attribute failed (%s)\n",
>> -             ib_get_err_str( status ) );
>> +      if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
>> +      {
>> +          osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
>> +                      "osm_ucast_mgr_set_fwd_table: "
>> +                      "Setting switch FT top to LID 0x%X\n",
>> +                      osm_switch_get_max_lid_ho( p_sw ) );
>> +      }
>> +     
>> +      context.si_context.light_sweep = FALSE;
>> +      context.si_context.node_guid = osm_node_get_node_guid( p_node );
>> +      context.si_context.set_method = TRUE;
>> +     
>> +      status = osm_req_set( p_mgr->p_req,
>> +                                    p_path,
>> +                                    (uint8_t*)&si,
>> +                                    sizeof(si),
>> +                                    IB_MAD_ATTR_SWITCH_INFO,
>> +                                    0,
>> +                                    CL_DISP_MSGID_NONE,
>> +                                    &context );
>> +     
>> +      if( status != IB_SUCCESS )
>> +      {
>> +          osm_log( p_mgr->p_log, OSM_LOG_ERROR,
>> +                      "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
>> +                      "Sending SwitchInfo attribute failed (%s)\n",
>> +                      ib_get_err_str( status ) );
>> +      }
>> +      else
>> +          p_mgr->any_change = TRUE;
>>    }
>>  
>>    /*
>> @@ -1215,13 +1234,14 @@ osm_ucast_mgr_process(
>>  
>>    CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
>>  
>> +  p_mgr->any_change = FALSE;
>> +
>>    /*
>>      If there are no switches in the subnet, we are done.
>>    */
>>    if (cl_qmap_count( p_sw_guid_tbl ) == 0)
>>      goto Exit;
>>  
>> -  p_mgr->any_change = FALSE;
>>    cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
>>  
>>    if (!p_routing_eng->build_lid_matrices ||
>> @@ -1248,14 +1268,20 @@ osm_ucast_mgr_process(
>>    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
>>      __osm_ucast_mgr_dump_tables( p_mgr );
>>  
>> -  if (p_mgr->any_change)
>> +  if (p_mgr->any_change)
>> +  {
>>       signal = OSM_SIGNAL_DONE_PENDING;
>> +      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
>> +                 "osm_ucast_mgr_process: "
>> +                 "LFT Tables configured on all switches\n");
>> +  }
>>    else
>> +  {
>> +      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
>> +                 "osm_ucast_mgr_process: "
>> +                 "No need to set any LFT Tables on all switches\n");
>>       signal = OSM_SIGNAL_DONE;
>> -
>> -  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
>> -          "osm_ucast_mgr_process: "
>> -          "LFT Tables configured on all switches\n");
>> +  }
>>  
>>   Exit:
>>    CL_PLOCK_RELEASE( p_mgr->p_lock );
>>
>>
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Mon Dec 18 11:35:53 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 21:35:53 +0200
Subject: [openib-general] [PATCH] osm: fix a bug in ignroing pending
 transaction of Light Sweep
In-Reply-To: <1166454599.32666.185925.camel@hal.voltaire.com>
References: <45844167.9060302@mellanox.co.il>
	<1166454599.32666.185925.camel@hal.voltaire.com>
Message-ID: <4586ED99.4070908@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sat, 2006-12-16 at 13:56, Eitan Zahavi wrote:
>   
>> Hi Hal
>>
>> This patch provides fixes an issue discovered by the nightly regression.
>> OpenSM state machine got stack due to pending SwitchInfo transaction 
>> being ignored since one of the queries for SwitchInfo
>> failed (due to bad-link).
>> The patch below simply avoids aborting the wait for all SwitchInfo 
>> requests to return.
>>
>> I think this issue might have hurt us in other situations too sine it 
>> aborted the wait on "CHANGE DETECTED" too.
>> CHANGE_DETECTED is fired on the first switch that reported "Change Bit".
>>
>> It is possible that the issue is showing up as we added incremental 
>> support (e.g. for routing)
>> Since only of there are no other SMP's sent during the heavy sweep we 
>> will get the
>> "NO_PENDING_TRANSACTIONS" signal caused by the SwitchInfo requests
>>     
>
> So is the same issue applicable to OFED 1.1 ?
>   
Yes it is.
>   
>> Eitan
>>
>> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il >
>>
>>  osm/opensm/osm_state_mgr.c |    5 ++---
>>  1 files changed, 2 insertions(+), 3 deletions(-)
>>     
>
> Thanks. Applied.
>
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From sashak at voltaire.com  Mon Dec 18 12:07:06 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 18 Dec 2006 22:07:06 +0200
Subject: [openib-general] [PATCH TRIVIAL] opensm/autogen.sh: error message
	fix
Message-ID: <20061218200706.GA12834@sashak.voltaire.com>


Trivial error message fixes in osm/autogen.sh

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/autogen.sh |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/osm/autogen.sh b/osm/autogen.sh
index 6570426..e463c0e 100755
--- a/osm/autogen.sh
+++ b/osm/autogen.sh
@@ -40,10 +40,10 @@ if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
     exit 1
 elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
     exit 1
 elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
     exit 1
 fi
 
-- 
1.4.4.2.gfc82d


From eitan at mellanox.co.il  Mon Dec 18 12:15:19 2006
From: eitan at mellanox.co.il (eitan at mellanox.co.il)
Date: Mon, 18 Dec 2006 22:15:19 +0200
Subject: [openib-general] [PATCH] osm: state manager return wrong signal
Message-ID: <1166472919660-git-send-email-eitan@mellanox.co.il>

From: Eitan Zahavi <eitan at sw053.yok.mtl.com>

diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 9eac038..94cc095 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1853,6 +1853,7 @@ osm_state_mgr_process(
 {
    ib_api_status_t status;
    osm_remote_sm_t *p_remote_sm;
+   osm_signal_t tmp_signal;
 
    CL_ASSERT( p_mgr );
 
@@ -2075,11 +2076,10 @@ osm_state_mgr_process(
          case OSM_SIGNAL_CHANGE_DETECTED:
             /*
              * Nothing to do here.  One subnet change typcially
-             * begets another....
+             * begets another.... But needs to wait for all transactions
              */
             signal = OSM_SIGNAL_NONE;
             break;
-
          case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
             /*
              * A change was detected on the subnet.
@@ -2219,7 +2219,10 @@ osm_state_mgr_process(
             signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
 
             /* the returned signal is always DONE */
-            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+
+            if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
+               signal = OSM_SIGNAL_DONE_PENDING;
 
             /* try to restore SA DB (this should be before lid_mgr
                because we may want to disable clients reregistration
-- 
1.4.4.1.GIT


From eitan at mellanox.co.il  Mon Dec 18 12:17:54 2006
From: eitan at mellanox.co.il (eitan at mellanox.co.il)
Date: Mon, 18 Dec 2006 22:17:54 +0200
Subject: [openib-general] [PATCH] osm: pkey manager returns wrong signal
Message-ID: <11664730741410-git-send-email-eitan@mellanox.co.il>

Fix cases where the pkey manager returned OSM_SIGNAL_DONE and not
OSM_SIGNAL_DONE_PENDING by missing some sent packets
---
 osm/opensm/osm_pkey_mgr.c |  112 +++++++++++++++++++++++++++++++++------------
 1 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index 48837bc..a33aec7 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
 
 /**********************************************************************
  **********************************************************************/
-static ib_api_status_t
+static boolean_t
 pkey_mgr_enforce_partition(
+  IN osm_log_t *p_log,
   IN const osm_req_t *p_req,
   IN const osm_physp_t *p_physp,
   IN const boolean_t enforce)
@@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
   osm_madw_context_t context;
   uint8_t payload[IB_SMP_DATA_SIZE];
   ib_port_info_t *p_pi;
+  ib_api_status_t status;
 
   if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
-    return IB_ERROR;
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0507: "
+              "No port info for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid( 
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
 
-  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
-    return IB_SUCCESS;
+  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE)) 
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "No need to update PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid( 
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+    return FALSE;
+  }
 
   memset( payload, 0, IB_SMP_DATA_SIZE );
   memcpy( payload, p_pi, sizeof(ib_port_info_t) );
@@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
   context.pi_context.light_sweep = FALSE;
   context.pi_context.active_transition = FALSE;
 
-  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
-                      payload, sizeof(payload),
-                      IB_MAD_ATTR_PORT_INFO,
-                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
-                      CL_DISP_MSGID_NONE, &context );
+  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
+								payload, sizeof(payload),
+								IB_MAD_ATTR_PORT_INFO,
+								cl_hton32( osm_physp_get_port_num( p_physp ) ),
+								CL_DISP_MSGID_NONE, &context );
+  if (status != IB_SUCCESS) 
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0520: "
+              "Failed to set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid( 
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
+  else
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "Set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid( 
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+	  return TRUE;
+  }
 }
 
 /**********************************************************************
@@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, block_index );
     if (status == IB_SUCCESS)
-      ret_val = TRUE;
+	 {
+		 osm_log( p_log, OSM_LOG_DEBUG,
+					 "pkey_mgr_update_port: "
+					 "Updated "
+					 "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+					 block_index,
+					 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					 osm_physp_get_port_num( p_physp ) );
+		 ret_val = TRUE;
+	 }
     else
-      osm_log( p_log, OSM_LOG_ERROR,
-	       "pkey_mgr_update_port: ERR 0506: "
-	       "pkey_mgr_update_pkey_entry() failed to update "
-	       "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-	       block_index,
-	       cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-	       osm_physp_get_port_num( p_physp ) );
+	 {
+		 osm_log( p_log, OSM_LOG_ERROR,
+					 "pkey_mgr_update_port: ERR 0506: "
+					 "pkey_mgr_update_pkey_entry() failed to update "
+					 "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
+					 block_index,
+					 cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+					 osm_physp_get_port_num( p_physp ) );
+	 }
   }
 
   return ret_val;
@@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
   uint16_t peer_max_blocks;
   ib_api_status_t status = IB_SUCCESS;
   boolean_t ret_val = FALSE;
+  boolean_t port_info_set = FALSE;
   ib_pkey_table_t empty_block;
-
+  
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
   p_physp = osm_port_get_default_phys_ptr( p_port );
@@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
     enforce = FALSE;
   }
 
-  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-	     "pkey_mgr_update_peer_port: ERR 0507: "
-	     "pkey_mgr_enforce_partition() failed to update "
-	     "node 0x%016" PRIx64 " port %u\n",
-	     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-	     osm_physp_get_port_num( peer ) );
-  }
+  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
+	  port_info_set = TRUE;
 
   if (enforce == FALSE)
-    return FALSE;
+	 return port_info_set;
 
   p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
   for (block_index = 0; block_index < p_pkey_tbl->used_blocks; block_index++)
@@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
              osm_physp_get_port_num( peer ) );
   }
 
+  if (port_info_set) return TRUE;
   return ret_val;
 }
 
@@ -541,10 +593,10 @@ osm_pkey_mgr_process(
       signal = OSM_SIGNAL_DONE_PENDING;
     p_node = osm_port_get_parent_node( p_port );
     if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
-	 pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, 
+			pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req, 
 				    &p_osm->subn, p_port,
 				    !p_osm->subn.opt.no_partition_enforcement ) )
-      signal = OSM_SIGNAL_DONE_PENDING;        
+      signal = OSM_SIGNAL_DONE_PENDING;
   }
 
  _err:
-- 
1.4.4.1.GIT


From eitan at mellanox.co.il  Mon Dec 18 12:19:34 2006
From: eitan at mellanox.co.il (eitan at mellanox.co.il)
Date: Mon, 18 Dec 2006 22:19:34 +0200
Subject: [openib-general] [PATCH] osm: ucast manager return wrong signal
Message-ID: <1166473174486-git-send-email-eitan@mellanox.co.il>

Fix an issue with not providing SIGNAL_DONE_PENDING in case when
SwitchInfo was sent
---
 osm/opensm/osm_ucast_mgr.c |   96 ++++++++++++++++++++++++++++----------------
 1 files changed, 61 insertions(+), 35 deletions(-)

diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index e977253..8cfe09e 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table(
   ib_switch_info_t si;
   uint32_t block_id_ho = 0;
   uint8_t block[IB_SMP_DATA_SIZE];
+  boolean_t set_swinfo_require = FALSE;
+  uint16_t lin_top;
+  uint8_t life_state;
 
   CL_ASSERT( p_mgr );
 
@@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table(
     Set the top of the unicast forwarding table.
   */
   si = *osm_switch_get_si_ptr( p_sw );
-  si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  if (si.lin_top != lin_top) 
+  {
+     set_swinfo_require = TRUE;
+     si.lin_top  = lin_top;
+  }
 
   /* check to see if the change state bit is on. If it is - then we
      need to clear it. */
-   if( ib_switch_info_get_state_change( &si ) )
-    si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
-                      | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
+  if ( ib_switch_info_get_state_change( &si ) )
+     life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
+                    | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
   else
-    si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
+     life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
 
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+  if (life_state != si.life_state)
   {
-    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-             "osm_ucast_mgr_set_fwd_table: "
-             "Setting switch FT top to LID 0x%X\n",
-             osm_switch_get_max_lid_ho( p_sw ) );
+     set_swinfo_require = TRUE;
+     si.life_state = life_state;
   }
-
-  context.si_context.light_sweep = FALSE;
-  context.si_context.node_guid = osm_node_get_node_guid( p_node );
-  context.si_context.set_method = TRUE;
-
-  status = osm_req_set( p_mgr->p_req,
-                        p_path,
-                        (uint8_t*)&si,
-                        sizeof(si),
-                        IB_MAD_ATTR_SWITCH_INFO,
-                        0,
-                        CL_DISP_MSGID_NONE,
-                        &context );
-
-  if( status != IB_SUCCESS )
+  
+  if ( set_swinfo_require )
   {
-    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
-             "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
-             "Sending SwitchInfo attribute failed (%s)\n",
-             ib_get_err_str( status ) );
+     if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+     {
+        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                 "osm_ucast_mgr_set_fwd_table: "
+                 "Setting switch FT top to LID 0x%X\n",
+                 osm_switch_get_max_lid_ho( p_sw ) );
+     }
+     
+     context.si_context.light_sweep = FALSE;
+     context.si_context.node_guid = osm_node_get_node_guid( p_node );
+     context.si_context.set_method = TRUE;
+     
+     status = osm_req_set( p_mgr->p_req,
+                           p_path,
+                           (uint8_t*)&si,
+                           sizeof(si),
+                           IB_MAD_ATTR_SWITCH_INFO,
+                           0,
+                           CL_DISP_MSGID_NONE,
+                           &context );
+     
+     if( status != IB_SUCCESS )
+     {
+        osm_log( p_mgr->p_log, OSM_LOG_ERROR,
+                 "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
+                 "Sending SwitchInfo attribute failed (%s)\n",
+                 ib_get_err_str( status ) );
+     } 
+     else 
+        p_mgr->any_change = TRUE;
   }
 
   /*
@@ -1215,13 +1234,14 @@ osm_ucast_mgr_process(
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
+  p_mgr->any_change = FALSE;
+
   /*
     If there are no switches in the subnet, we are done.
   */
   if (cl_qmap_count( p_sw_guid_tbl ) == 0)
     goto Exit;
 
-  p_mgr->any_change = FALSE;
   cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
 
   if (!p_routing_eng->build_lid_matrices ||
@@ -1248,14 +1268,20 @@ osm_ucast_mgr_process(
   if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
     __osm_ucast_mgr_dump_tables( p_mgr );
 
-  if (p_mgr->any_change)
+  if (p_mgr->any_change) 
+  {
      signal = OSM_SIGNAL_DONE_PENDING;
+	  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+				 "osm_ucast_mgr_process: "
+				 "LFT Tables configured on all switches\n");
+  }
   else
+  {
+	  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+				 "osm_ucast_mgr_process: "
+				 "No need to set any LFT Tables on all switches\n");
      signal = OSM_SIGNAL_DONE;
-
-  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
-          "osm_ucast_mgr_process: "
-          "LFT Tables configured on all switches\n");
+  }
 
  Exit:
   CL_PLOCK_RELEASE( p_mgr->p_lock );
-- 
1.4.4.1.GIT


From sashak at voltaire.com  Mon Dec 18 13:18:14 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Mon, 18 Dec 2006 23:18:14 +0200
Subject: [openib-general] [PATCH] ibutils: autogen.sh(s) fixes
Message-ID: <20061218211814.GC12834@sashak.voltaire.com>


Couple of fixes around of tools version detections and verifications
(similar to r9976):
- regular expression fix - proper version string separation
- numeric comparison for extracted version elements
- non-zero exit status when old tools are detected
- slightly improved condition statements

Originally autogen.sh was claiming that automake-1.10 is older that
automake-1.9.2

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 autogen.sh          |   57 +++++++++++++++++++++--------------------------
 ibdiag/autogen.sh   |   61 +++++++++++++++++++++++---------------------------
 ibdm/autogen.sh     |   61 +++++++++++++++++++++++---------------------------
 ibis/autogen.sh     |   55 +++++++++++++++++++++-------------------------
 ibmgtsim/autogen.sh |   55 +++++++++++++++++++++-------------------------
 5 files changed, 132 insertions(+), 157 deletions(-)

diff --git a/autogen.sh b/autogen.sh
index 30727a8..3a560b5 100755
--- a/autogen.sh
+++ b/autogen.sh
@@ -1,53 +1,48 @@
-#!/bin/bash 
+#!/bin/bash
 cd ${0%*/*}
 
 # make sure autoconf is up-to-date
-ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
+ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
 ac_maj=`echo $ac_ver|sed 's/\..*//'`
 ac_min=`echo $ac_ver|sed 's/.*\.//'`
-if [[ $ac_maj < 2 ]]; then 
+if [[ $ac_maj -lt 2 ]]; then
     echo Min autoconf version is 2.59
-    exit
-fi
-if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
+    exit 1
+elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
     echo Min autoconf version is 2.59
-    exit
+    exit 1
 fi
 
 # make sure automake is up-to-date
-am_ver=`automake --version | head -1 | awk '{print $NF}'`
+am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
 am_maj=`echo $am_ver|sed 's/\..*//'`
-am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-am_sub=`echo $am_ver|sed 's/.*\.//'`
-if [[ $am_maj < 1 ]]; then 
+am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $am_maj -lt 1 ]]; then
     echo Min automake version is 1.9.2
-    exit
-fi
-if [[ $am_maj = 1 && $am_min < 9 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
-fi
-if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
+    exit 1
 fi
 
 # make sure libtool is up-to-date
-lt_ver=`libtool --version | head -1 | awk '{print $4}'`
+lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
 lt_maj=`echo $lt_ver|sed 's/\..*//'`
-lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-lt_sub=`echo $lt_ver|sed 's/.*\.//'`
-if [[ $lt_maj < 1 ]]; then 
+lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
 fi
 
 # cleanup
diff --git a/ibdiag/autogen.sh b/ibdiag/autogen.sh
index 60732a8..0ce2866 100755
--- a/ibdiag/autogen.sh
+++ b/ibdiag/autogen.sh
@@ -1,57 +1,52 @@
-#!/bin/bash 
+#!/bin/bash
 
 # We change dir since the later utilities assume to work in the project dir
 cd ${0%*/*}
 # remove previous
-\rm -rf autom4te.cache 
+\rm -rf autom4te.cache
 \rm -rf aclocal.m4
 # make sure autoconf is up-to-date
-ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
+ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
 ac_maj=`echo $ac_ver|sed 's/\..*//'`
 ac_min=`echo $ac_ver|sed 's/.*\.//'`
-if [[ $ac_maj < 2 ]]; then 
+if [[ $ac_maj -lt 2 ]]; then
     echo Min autoconf version is 2.59
-    exit
-fi
-if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
+    exit 1
+elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
     echo Min autoconf version is 2.59
-    exit
+    exit 1
 fi
 # make sure automake is up-to-date
-am_ver=`automake --version | head -1 | awk '{print $NF}'`
+am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
 am_maj=`echo $am_ver|sed 's/\..*//'`
-am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-am_sub=`echo $am_ver|sed 's/.*\.//'`
-if [[ $am_maj < 1 ]]; then 
+am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $am_maj -lt 1 ]]; then
     echo Min automake version is 1.9.2
-    exit
-fi
-if [[ $am_maj = 1 && $am_min < 9 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
-fi
-if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
+    exit 1
 fi
 # make sure libtool is up-to-date
-lt_ver=`libtool --version | head -1 | awk '{print $4}'`
+lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
 lt_maj=`echo $lt_ver|sed 's/\..*//'`
-lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-lt_sub=`echo $lt_ver|sed 's/.*\.//'`
-if [[ $lt_maj < 1 ]]; then 
+lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
-    exit
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
 fi
-if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-    
+
 aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
 libtoolize --automake
 automake --add-missing --gnu
diff --git a/ibdm/autogen.sh b/ibdm/autogen.sh
index d8f08d8..51163c9 100755
--- a/ibdm/autogen.sh
+++ b/ibdm/autogen.sh
@@ -1,57 +1,52 @@
-#!/bin/bash 
+#!/bin/bash
 
 # We change dir since the later utilities assume to work in the project dir
 cd ${0%*/*}
 # remove previous
-\rm -rf autom4te.cache 
+\rm -rf autom4te.cache
 \rm -rf aclocal.m4
 # make sure autoconf is up-to-date
-ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
+ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
 ac_maj=`echo $ac_ver|sed 's/\..*//'`
 ac_min=`echo $ac_ver|sed 's/.*\.//'`
-if [[ $ac_maj < 2 ]]; then 
+if [[ $ac_maj -lt 2 ]]; then
     echo Min autoconf version is 2.59
-    exit
-fi
-if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
+    exit 1
+elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
     echo Min autoconf version is 2.59
-    exit
+    exit 1
 fi
 # make sure automake is up-to-date
-am_ver=`automake --version | head -1 | awk '{print $NF}'`
+am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
 am_maj=`echo $am_ver|sed 's/\..*//'`
-am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-am_sub=`echo $am_ver|sed 's/.*\.//'`
-if [[ $am_maj < 1 ]]; then 
+am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $am_maj -lt 1 ]]; then
     echo Min automake version is 1.9.2
-    exit
-fi
-if [[ $am_maj = 1 && $am_min < 9 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
-fi
-if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
+    exit 1
 fi
 # make sure libtool is up-to-date
-lt_ver=`libtool --version | head -1 | awk '{print $4}'`
+lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
 lt_maj=`echo $lt_ver|sed 's/\..*//'`
-lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-lt_sub=`echo $lt_ver|sed 's/.*\.//'`
-if [[ $lt_maj < 1 ]]; then 
+lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
-    exit
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
+    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
+    exit 1
 fi
-if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
-    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-    
+
 aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
 libtoolize --automake --copy
 automake --add-missing --gnu --copy
diff --git a/ibis/autogen.sh b/ibis/autogen.sh
index f3ed611..ae545b5 100755
--- a/ibis/autogen.sh
+++ b/ibis/autogen.sh
@@ -1,57 +1,52 @@
-#!/bin/sh 
+#!/bin/sh
 
 cd ${0%*/*}
 \rm -rf autom4te.cache
 \rm -rf aclocal.m4
 \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess
 # make sure autoconf is up-to-date
-ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
+ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
 ac_maj=`echo $ac_ver|sed 's/\..*//'`
 ac_min=`echo $ac_ver|sed 's/.*\.//'`
-if [[ $ac_maj < 2 ]]; then 
+if [[ $ac_maj -lt 2 ]]; then
     echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
-    exit
-fi
-if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
+    exit 1
+elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
     echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
-    exit
+    exit 1
 fi
 # make sure automake is up-to-date
-am_ver=`automake --version | head -1 | awk '{print $NF}'`
+am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
 am_maj=`echo $am_ver|sed 's/\..*//'`
-am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-am_sub=`echo $am_ver|sed 's/.*\.//'`
-if [[ $am_maj < 1 ]]; then 
+am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $am_maj -lt 1 ]]; then
     echo Min automake version is 1.9.2
-    exit
-fi
-if [[ $am_maj = 1 && $am_min < 9 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
-fi
-if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
+    exit 1
 fi
 # make sure libtool is up-to-date
-lt_ver=`libtool --version | head -1 | awk '{print $4}'`
+lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
 lt_maj=`echo $lt_ver|sed 's/\..*//'`
-lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-lt_sub=`echo $lt_ver|sed 's/.*\.//'`
-if [[ $lt_maj < 1 ]]; then 
+lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
     echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
     echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
+    exit 1
 fi
 
 aclocal -I config 2>&1 |  grep -v "arning: underquoted definition of"
-libtoolize --automake --copy 
+libtoolize --automake --copy
 automake --add-missing --gnu --copy --force
 autoconf
diff --git a/ibmgtsim/autogen.sh b/ibmgtsim/autogen.sh
index 456c203..e48b0ac 100755
--- a/ibmgtsim/autogen.sh
+++ b/ibmgtsim/autogen.sh
@@ -1,57 +1,52 @@
-#!/bin/sh 
+#!/bin/sh
 
 cd ${0%*/*}
 \rm -rf autom4te.cache
 \rm -rf aclocal.m4
 \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess
 # make sure autoconf is up-to-date
-ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
+ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
 ac_maj=`echo $ac_ver|sed 's/\..*//'`
 ac_min=`echo $ac_ver|sed 's/.*\.//'`
-if [[ $ac_maj < 2 ]]; then 
+if [[ $ac_maj -lt 2 ]]; then
     echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
-    exit
-fi
-if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
+    exit 1
+elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
     echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
-    exit
+    exit 1
 fi
 # make sure automake is up-to-date
-am_ver=`automake --version | head -1 | awk '{print $NF}'`
+am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
 am_maj=`echo $am_ver|sed 's/\..*//'`
-am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-am_sub=`echo $am_ver|sed 's/.*\.//'`
-if [[ $am_maj < 1 ]]; then 
+am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $am_maj -lt 1 ]]; then
     echo Min automake version is 1.9.2
-    exit
-fi
-if [[ $am_maj = 1 && $am_min < 9 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
-fi
-if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
+    exit 1
+elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
     echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
-    exit
+    exit 1
 fi
 # make sure libtool is up-to-date
-lt_ver=`libtool --version | head -1 | awk '{print $4}'`
+lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
 lt_maj=`echo $lt_ver|sed 's/\..*//'`
-lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
-lt_sub=`echo $lt_ver|sed 's/.*\.//'`
-if [[ $lt_maj < 1 ]]; then 
+lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
+lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
+if [[ $lt_maj -lt 1 ]]; then
     echo Min libtool version is 1.4.2
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
     echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
-fi
-if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
+    exit 1
+elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
     echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
-    exit
+    exit 1
 fi
 
 aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
-libtoolize --automake --copy --force 
+libtoolize --automake --copy --force
 automake --add-missing --copy --gnu --force
 autoconf
-- 
1.4.4.2.gfc82d


From eitan at mellanox.co.il  Mon Dec 18 13:24:20 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 23:24:20 +0200
Subject: [openib-general] [PATCH] ibutils: autogen.sh(s) fixes
In-Reply-To: <20061218211814.GC12834@sashak.voltaire.com>
References: <20061218211814.GC12834@sashak.voltaire.com>
Message-ID: <45870704.2000003@mellanox.co.il>

Thanks Applied.

Sasha Khapyorsky wrote:
> Couple of fixes around of tools version detections and verifications
> (similar to r9976):
> - regular expression fix - proper version string separation
> - numeric comparison for extracted version elements
> - non-zero exit status when old tools are detected
> - slightly improved condition statements
>
> Originally autogen.sh was claiming that automake-1.10 is older that
> automake-1.9.2
>
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> ---
>  autogen.sh          |   57 +++++++++++++++++++++--------------------------
>  ibdiag/autogen.sh   |   61 +++++++++++++++++++++++---------------------------
>  ibdm/autogen.sh     |   61 +++++++++++++++++++++++---------------------------
>  ibis/autogen.sh     |   55 +++++++++++++++++++++-------------------------
>  ibmgtsim/autogen.sh |   55 +++++++++++++++++++++-------------------------
>  5 files changed, 132 insertions(+), 157 deletions(-)
>
> diff --git a/autogen.sh b/autogen.sh
> index 30727a8..3a560b5 100755
> --- a/autogen.sh
> +++ b/autogen.sh
> @@ -1,53 +1,48 @@
> -#!/bin/bash 
> +#!/bin/bash
>  cd ${0%*/*}
>  
>  # make sure autoconf is up-to-date
> -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
> +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
>  ac_maj=`echo $ac_ver|sed 's/\..*//'`
>  ac_min=`echo $ac_ver|sed 's/.*\.//'`
> -if [[ $ac_maj < 2 ]]; then 
> +if [[ $ac_maj -lt 2 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> -fi
> -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
> +    exit 1
> +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> +    exit 1
>  fi
>  
>  # make sure automake is up-to-date
> -am_ver=`automake --version | head -1 | awk '{print $NF}'`
> +am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
>  am_maj=`echo $am_ver|sed 's/\..*//'`
> -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -am_sub=`echo $am_ver|sed 's/.*\.//'`
> -if [[ $am_maj < 1 ]]; then 
> +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $am_maj -lt 1 ]]; then
>      echo Min automake version is 1.9.2
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min < 9 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> +    exit 1
>  fi
>  
>  # make sure libtool is up-to-date
> -lt_ver=`libtool --version | head -1 | awk '{print $4}'`
> +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
>  lt_maj=`echo $lt_ver|sed 's/\..*//'`
> -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -lt_sub=`echo $lt_ver|sed 's/.*\.//'`
> -if [[ $lt_maj < 1 ]]; then 
> +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $lt_maj -lt 1 ]]; then
>      echo Min libtool version is 1.4.2
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
>  fi
>  
>  # cleanup
> diff --git a/ibdiag/autogen.sh b/ibdiag/autogen.sh
> index 60732a8..0ce2866 100755
> --- a/ibdiag/autogen.sh
> +++ b/ibdiag/autogen.sh
> @@ -1,57 +1,52 @@
> -#!/bin/bash 
> +#!/bin/bash
>  
>  # We change dir since the later utilities assume to work in the project dir
>  cd ${0%*/*}
>  # remove previous
> -\rm -rf autom4te.cache 
> +\rm -rf autom4te.cache
>  \rm -rf aclocal.m4
>  # make sure autoconf is up-to-date
> -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
> +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
>  ac_maj=`echo $ac_ver|sed 's/\..*//'`
>  ac_min=`echo $ac_ver|sed 's/.*\.//'`
> -if [[ $ac_maj < 2 ]]; then 
> +if [[ $ac_maj -lt 2 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> -fi
> -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
> +    exit 1
> +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> +    exit 1
>  fi
>  # make sure automake is up-to-date
> -am_ver=`automake --version | head -1 | awk '{print $NF}'`
> +am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
>  am_maj=`echo $am_ver|sed 's/\..*//'`
> -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -am_sub=`echo $am_ver|sed 's/.*\.//'`
> -if [[ $am_maj < 1 ]]; then 
> +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $am_maj -lt 1 ]]; then
>      echo Min automake version is 1.9.2
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min < 9 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> +    exit 1
>  fi
>  # make sure libtool is up-to-date
> -lt_ver=`libtool --version | head -1 | awk '{print $4}'`
> +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
>  lt_maj=`echo $lt_ver|sed 's/\..*//'`
> -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -lt_sub=`echo $lt_ver|sed 's/.*\.//'`
> -if [[ $lt_maj < 1 ]]; then 
> +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $lt_maj -lt 1 ]]; then
>      echo Min libtool version is 1.4.2
> -    exit
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
>  fi
> -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -    
> +
>  aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
>  libtoolize --automake
>  automake --add-missing --gnu
> diff --git a/ibdm/autogen.sh b/ibdm/autogen.sh
> index d8f08d8..51163c9 100755
> --- a/ibdm/autogen.sh
> +++ b/ibdm/autogen.sh
> @@ -1,57 +1,52 @@
> -#!/bin/bash 
> +#!/bin/bash
>  
>  # We change dir since the later utilities assume to work in the project dir
>  cd ${0%*/*}
>  # remove previous
> -\rm -rf autom4te.cache 
> +\rm -rf autom4te.cache
>  \rm -rf aclocal.m4
>  # make sure autoconf is up-to-date
> -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
> +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
>  ac_maj=`echo $ac_ver|sed 's/\..*//'`
>  ac_min=`echo $ac_ver|sed 's/.*\.//'`
> -if [[ $ac_maj < 2 ]]; then 
> +if [[ $ac_maj -lt 2 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> -fi
> -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
> +    exit 1
> +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
>      echo Min autoconf version is 2.59
> -    exit
> +    exit 1
>  fi
>  # make sure automake is up-to-date
> -am_ver=`automake --version | head -1 | awk '{print $NF}'`
> +am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
>  am_maj=`echo $am_ver|sed 's/\..*//'`
> -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -am_sub=`echo $am_ver|sed 's/.*\.//'`
> -if [[ $am_maj < 1 ]]; then 
> +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $am_maj -lt 1 ]]; then
>      echo Min automake version is 1.9.2
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min < 9 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> +    exit 1
>  fi
>  # make sure libtool is up-to-date
> -lt_ver=`libtool --version | head -1 | awk '{print $4}'`
> +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
>  lt_maj=`echo $lt_ver|sed 's/\..*//'`
> -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -lt_sub=`echo $lt_ver|sed 's/.*\.//'`
> -if [[ $lt_maj < 1 ]]; then 
> +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $lt_maj -lt 1 ]]; then
>      echo Min libtool version is 1.4.2
> -    exit
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
> +    echo "libtool version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> +    exit 1
>  fi
> -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
> -    echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -    
> +
>  aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
>  libtoolize --automake --copy
>  automake --add-missing --gnu --copy
> diff --git a/ibis/autogen.sh b/ibis/autogen.sh
> index f3ed611..ae545b5 100755
> --- a/ibis/autogen.sh
> +++ b/ibis/autogen.sh
> @@ -1,57 +1,52 @@
> -#!/bin/sh 
> +#!/bin/sh
>  
>  cd ${0%*/*}
>  \rm -rf autom4te.cache
>  \rm -rf aclocal.m4
>  \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess
>  # make sure autoconf is up-to-date
> -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
> +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
>  ac_maj=`echo $ac_ver|sed 's/\..*//'`
>  ac_min=`echo $ac_ver|sed 's/.*\.//'`
> -if [[ $ac_maj < 2 ]]; then 
> +if [[ $ac_maj -lt 2 ]]; then
>      echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
> -    exit
> -fi
> -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
> +    exit 1
> +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
>      echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
> -    exit
> +    exit 1
>  fi
>  # make sure automake is up-to-date
> -am_ver=`automake --version | head -1 | awk '{print $NF}'`
> +am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
>  am_maj=`echo $am_ver|sed 's/\..*//'`
> -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -am_sub=`echo $am_ver|sed 's/.*\.//'`
> -if [[ $am_maj < 1 ]]; then 
> +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $am_maj -lt 1 ]]; then
>      echo Min automake version is 1.9.2
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min < 9 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> +    exit 1
>  fi
>  # make sure libtool is up-to-date
> -lt_ver=`libtool --version | head -1 | awk '{print $4}'`
> +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
>  lt_maj=`echo $lt_ver|sed 's/\..*//'`
> -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -lt_sub=`echo $lt_ver|sed 's/.*\.//'`
> -if [[ $lt_maj < 1 ]]; then 
> +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $lt_maj -lt 1 ]]; then
>      echo Min libtool version is 1.4.2
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
>      echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
>      echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> +    exit 1
>  fi
>  
>  aclocal -I config 2>&1 |  grep -v "arning: underquoted definition of"
> -libtoolize --automake --copy 
> +libtoolize --automake --copy
>  automake --add-missing --gnu --copy --force
>  autoconf
> diff --git a/ibmgtsim/autogen.sh b/ibmgtsim/autogen.sh
> index 456c203..e48b0ac 100755
> --- a/ibmgtsim/autogen.sh
> +++ b/ibmgtsim/autogen.sh
> @@ -1,57 +1,52 @@
> -#!/bin/sh 
> +#!/bin/sh
>  
>  cd ${0%*/*}
>  \rm -rf autom4te.cache
>  \rm -rf aclocal.m4
>  \rm -f config/missing config/install-sh config/depcomp config/mkinstalldirs config/ltmain.sh config/config.sub config/config.guess
>  # make sure autoconf is up-to-date
> -ac_ver=`autoconf --version | head -1 | awk '{print $NF}'`
> +ac_ver=`autoconf --version | head -n 1 | awk '{print $NF}'`
>  ac_maj=`echo $ac_ver|sed 's/\..*//'`
>  ac_min=`echo $ac_ver|sed 's/.*\.//'`
> -if [[ $ac_maj < 2 ]]; then 
> +if [[ $ac_maj -lt 2 ]]; then
>      echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
> -    exit
> -fi
> -if [[ $ac_maj = 2 && $ac_min < 59 ]]; then 
> +    exit 1
> +elif [[ $ac_maj -eq 2 && $ac_min -lt 59 ]]; then
>      echo "autoconf version is too old:$ac_maj.$ac_min < required 2.59"
> -    exit
> +    exit 1
>  fi
>  # make sure automake is up-to-date
> -am_ver=`automake --version | head -1 | awk '{print $NF}'`
> +am_ver=`automake --version | head -n 1 | awk '{print $NF}'`
>  am_maj=`echo $am_ver|sed 's/\..*//'`
> -am_min=`echo $am_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -am_sub=`echo $am_ver|sed 's/.*\.//'`
> -if [[ $am_maj < 1 ]]; then 
> +am_min=`echo $am_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +am_sub=`echo $am_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $am_maj -lt 1 ]]; then
>      echo Min automake version is 1.9.2
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min < 9 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -lt 9 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> -fi
> -if [[ $am_maj = 1 && $am_min = 9 && $am_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $am_maj -eq 1 && $am_min -eq 9 && $am_sub -lt 2 ]]; then
>      echo "automake version is too old:$am_maj.$am_min.$am_sub < required 1.9.2"
> -    exit
> +    exit 1
>  fi
>  # make sure libtool is up-to-date
> -lt_ver=`libtool --version | head -1 | awk '{print $4}'`
> +lt_ver=`libtool --version | head -n 1 | awk '{print $4}'`
>  lt_maj=`echo $lt_ver|sed 's/\..*//'`
> -lt_min=`echo $lt_ver|sed 's/.*\.\([^\.]*\)\..*/\1/'`
> -lt_sub=`echo $lt_ver|sed 's/.*\.//'`
> -if [[ $lt_maj < 1 ]]; then 
> +lt_min=`echo $lt_ver|sed 's/[^\.]*\.\([^\.]*\)\.*.*/\1/'`
> +lt_sub=`echo $lt_ver|sed 's/[^\.]*\.[^\.]*\.*//'`
> +if [[ $lt_maj -lt 1 ]]; then
>      echo Min libtool version is 1.4.2
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min < 4 ]]; then 
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -lt 4 ]]; then
>      echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> -fi
> -if [[ $lt_maj = 1 && $lt_min = 4 && $lt_sub < 2 ]]; then 
> +    exit 1
> +elif [[ $lt_maj -eq 1 && $lt_min -eq 4 && $lt_sub -lt 2 ]]; then
>      echo "automake version is too old:$lt_maj.$lt_min.$lt_sub < required 1.4.2"
> -    exit
> +    exit 1
>  fi
>  
>  aclocal -I config 2>&1 | grep -v "warning: underquoted definition "
> -libtoolize --automake --copy --force 
> +libtoolize --automake --copy --force
>  automake --add-missing --copy --gnu --force
>  autoconf
>   


From eitan at mellanox.co.il  Mon Dec 18 13:38:17 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 23:38:17 +0200
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager fail
 to report back correct signal
Message-ID: <45870A49.1070205@mellanox.co.il>

Hi Hal,

This is a resend as I did not see a bounce of the list of the previous 
posting I did using git-send-email (probably due to a miss use).
The following patch fixes bugs in the ucast manager and pkey manager 
such that they do not report correct signal back.
In both cases some some outstanding SubnSet were ignored.

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

--------------------------------------------------------------------------------------------
diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index 48837bc..a33aec7 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
 
 /**********************************************************************
  **********************************************************************/
-static ib_api_status_t
+static boolean_t
 pkey_mgr_enforce_partition(
+  IN osm_log_t *p_log,
   IN const osm_req_t *p_req,
   IN const osm_physp_t *p_physp,
   IN const boolean_t enforce)
@@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
   osm_madw_context_t context;
   uint8_t payload[IB_SMP_DATA_SIZE];
   ib_port_info_t *p_pi;
+  ib_api_status_t status;
 
   if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
-    return IB_ERROR;
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0507: "
+              "No port info for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
 
-  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
-    return IB_SUCCESS;
+  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "No need to update PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+    return FALSE;
+  }
 
   memset( payload, 0, IB_SMP_DATA_SIZE );
   memcpy( payload, p_pi, sizeof(ib_port_info_t) );
@@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
   context.pi_context.light_sweep = FALSE;
   context.pi_context.active_transition = FALSE;
 
-  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
-                      payload, sizeof(payload),
-                      IB_MAD_ATTR_PORT_INFO,
-                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
-                      CL_DISP_MSGID_NONE, &context );
+  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
+                                payload, sizeof(payload),
+                                IB_MAD_ATTR_PORT_INFO,
+                                cl_hton32( osm_physp_get_port_num( 
p_physp ) ),
+                                CL_DISP_MSGID_NONE, &context );
+  if (status != IB_SUCCESS)
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0520: "
+              "Failed to set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
+  else
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "Set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+      return TRUE;
+  }
 }
 
 /**********************************************************************
@@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
block_index );
     if (status == IB_SUCCESS)
-      ret_val = TRUE;
+     {
+         osm_log( p_log, OSM_LOG_DEBUG,
+                     "pkey_mgr_update_port: "
+                     "Updated "
+                     "pkey table block %d for node 0x%016" PRIx64 " 
port %u\n",
+                     block_index,
+                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+                     osm_physp_get_port_num( p_physp ) );
+         ret_val = TRUE;
+     }
     else
-      osm_log( p_log, OSM_LOG_ERROR,
-           "pkey_mgr_update_port: ERR 0506: "
-           "pkey_mgr_update_pkey_entry() failed to update "
-           "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-           block_index,
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-           osm_physp_get_port_num( p_physp ) );
+     {
+         osm_log( p_log, OSM_LOG_ERROR,
+                     "pkey_mgr_update_port: ERR 0506: "
+                     "pkey_mgr_update_pkey_entry() failed to update "
+                     "pkey table block %d for node 0x%016" PRIx64 " 
port %u\n",
+                     block_index,
+                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+                     osm_physp_get_port_num( p_physp ) );
+     }
   }
 
   return ret_val;
@@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
   uint16_t peer_max_blocks;
   ib_api_status_t status = IB_SUCCESS;
   boolean_t ret_val = FALSE;
+  boolean_t port_info_set = FALSE;
   ib_pkey_table_t empty_block;
-
+ 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
   p_physp = osm_port_get_default_phys_ptr( p_port );
@@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
     enforce = FALSE;
   }
 
-  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-         "pkey_mgr_update_peer_port: ERR 0507: "
-         "pkey_mgr_enforce_partition() failed to update "
-         "node 0x%016" PRIx64 " port %u\n",
-         cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-         osm_physp_get_port_num( peer ) );
-  }
+  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
+      port_info_set = TRUE;
 
   if (enforce == FALSE)
-    return FALSE;
+     return port_info_set;
 
   p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
   for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
block_index++)
@@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
              osm_physp_get_port_num( peer ) );
   }
 
+  if (port_info_set) return TRUE;
   return ret_val;
 }
 
@@ -541,10 +593,10 @@ osm_pkey_mgr_process(
       signal = OSM_SIGNAL_DONE_PENDING;
     p_node = osm_port_get_parent_node( p_port );
     if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
-     pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+            pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
                     &p_osm->subn, p_port,
                     !p_osm->subn.opt.no_partition_enforcement ) )
-      signal = OSM_SIGNAL_DONE_PENDING;       
+      signal = OSM_SIGNAL_DONE_PENDING;
   }
 
  _err:
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index e977253..8cfe09e 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -885,6 +885,9 @@ osm_ucast_mgr_set_fwd_table(
   ib_switch_info_t si;
   uint32_t block_id_ho = 0;
   uint8_t block[IB_SMP_DATA_SIZE];
+  boolean_t set_swinfo_require = FALSE;
+  uint16_t lin_top;
+  uint8_t life_state;
 
   CL_ASSERT( p_mgr );
 
@@ -904,43 +907,59 @@ osm_ucast_mgr_set_fwd_table(
     Set the top of the unicast forwarding table.
   */
   si = *osm_switch_get_si_ptr( p_sw );
-  si.lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  lin_top = cl_hton16( osm_switch_get_max_lid_ho( p_sw ) );
+  if (si.lin_top != lin_top)
+  {
+     set_swinfo_require = TRUE;
+     si.lin_top  = lin_top;
+  }
 
   /* check to see if the change state bit is on. If it is - then we
      need to clear it. */
-   if( ib_switch_info_get_state_change( &si ) )
-    si.life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
-                      | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
+  if ( ib_switch_info_get_state_change( &si ) )
+     life_state = ( (p_mgr->p_subn->opt.packet_life_time <<3 )
+                    | ( si.life_state & IB_SWITCH_PSC ) )  & 0xfc;
   else
-    si.life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
+     life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
 
-  if( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+  if (life_state != si.life_state)
   {
-    osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
-             "osm_ucast_mgr_set_fwd_table: "
-             "Setting switch FT top to LID 0x%X\n",
-             osm_switch_get_max_lid_ho( p_sw ) );
+     set_swinfo_require = TRUE;
+     si.life_state = life_state;
   }
-
-  context.si_context.light_sweep = FALSE;
-  context.si_context.node_guid = osm_node_get_node_guid( p_node );
-  context.si_context.set_method = TRUE;
-
-  status = osm_req_set( p_mgr->p_req,
-                        p_path,
-                        (uint8_t*)&si,
-                        sizeof(si),
-                        IB_MAD_ATTR_SWITCH_INFO,
-                        0,
-                        CL_DISP_MSGID_NONE,
-                        &context );
-
-  if( status != IB_SUCCESS )
+ 
+  if ( set_swinfo_require )
   {
-    osm_log( p_mgr->p_log, OSM_LOG_ERROR,
-             "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
-             "Sending SwitchInfo attribute failed (%s)\n",
-             ib_get_err_str( status ) );
+     if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_DEBUG ) )
+     {
+        osm_log( p_mgr->p_log, OSM_LOG_DEBUG,
+                 "osm_ucast_mgr_set_fwd_table: "
+                 "Setting switch FT top to LID 0x%X\n",
+                 osm_switch_get_max_lid_ho( p_sw ) );
+     }
+    
+     context.si_context.light_sweep = FALSE;
+     context.si_context.node_guid = osm_node_get_node_guid( p_node );
+     context.si_context.set_method = TRUE;
+    
+     status = osm_req_set( p_mgr->p_req,
+                           p_path,
+                           (uint8_t*)&si,
+                           sizeof(si),
+                           IB_MAD_ATTR_SWITCH_INFO,
+                           0,
+                           CL_DISP_MSGID_NONE,
+                           &context );
+    
+     if( status != IB_SUCCESS )
+     {
+        osm_log( p_mgr->p_log, OSM_LOG_ERROR,
+                 "osm_ucast_mgr_set_fwd_table: ERR 3A06: "
+                 "Sending SwitchInfo attribute failed (%s)\n",
+                 ib_get_err_str( status ) );
+     }
+     else
+        p_mgr->any_change = TRUE;
   }
 
   /*
@@ -1215,13 +1234,14 @@ osm_ucast_mgr_process(
 
   CL_PLOCK_EXCL_ACQUIRE( p_mgr->p_lock );
 
+  p_mgr->any_change = FALSE;
+
   /*
     If there are no switches in the subnet, we are done.
   */
   if (cl_qmap_count( p_sw_guid_tbl ) == 0)
     goto Exit;
 
-  p_mgr->any_change = FALSE;
   cl_qmap_apply_func(p_sw_guid_tbl, __osm_ucast_mgr_clean_switch, NULL);
 
   if (!p_routing_eng->build_lid_matrices ||
@@ -1248,14 +1268,20 @@ osm_ucast_mgr_process(
   if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
     __osm_ucast_mgr_dump_tables( p_mgr );
 
-  if (p_mgr->any_change)
+  if (p_mgr->any_change)
+  {
      signal = OSM_SIGNAL_DONE_PENDING;
+      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+                 "osm_ucast_mgr_process: "
+                 "LFT Tables configured on all switches\n");
+  }
   else
+  {
+      osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
+                 "osm_ucast_mgr_process: "
+                 "No need to set any LFT Tables on all switches\n");
      signal = OSM_SIGNAL_DONE;
-
-  osm_log(p_mgr->p_log, OSM_LOG_VERBOSE,
-          "osm_ucast_mgr_process: "
-          "LFT Tables configured on all switches\n");
+  }
 
  Exit:
   CL_PLOCK_RELEASE( p_mgr->p_lock );
-- 
1.4.4.1.GIT


From eitan at mellanox.co.il  Mon Dec 18 13:43:39 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 23:43:39 +0200
Subject: [openib-general] [PATCH] osm: state manager ignores some
	outstanding transaction
Message-ID: <45870B8B.9010002@mellanox.co.il>

Hi Hal,

This is a resend as I did not see a bounce of the list of the previous 
posting I did using git-send-email (probably due to a miss use).

The following patch fixes bugs in the state manager:
Both in light sweep and pkey assignment states the state manager could ignore
outstanding SMPs (reported back by the managers) and continue to next stage.
When these SMPs do complete it causes failures of further steps which receives 
the NO_PENDING_TRANSACTIONS signal when it is not expected. 

Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
---

 osm/opensm/osm_state_mgr.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
index 9eac038..94cc095 100644
--- a/osm/opensm/osm_state_mgr.c
+++ b/osm/opensm/osm_state_mgr.c
@@ -1853,6 +1853,7 @@ osm_state_mgr_process(
 {
    ib_api_status_t status;
    osm_remote_sm_t *p_remote_sm;
+   osm_signal_t tmp_signal;
 
    CL_ASSERT( p_mgr );
 
@@ -2075,11 +2076,10 @@ osm_state_mgr_process(
          case OSM_SIGNAL_CHANGE_DETECTED:
             /*
              * Nothing to do here.  One subnet change typcially
-             * begets another....
+             * begets another.... But needs to wait for all transactions
              */
             signal = OSM_SIGNAL_NONE;
             break;
-
          case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
             /*
              * A change was detected on the subnet.
@@ -2219,7 +2219,10 @@ osm_state_mgr_process(
             signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
 
             /* the returned signal is always DONE */
-            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
+
+            if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
+               signal = OSM_SIGNAL_DONE_PENDING;
 
             /* try to restore SA DB (this should be before lid_mgr
                because we may want to disable clients reregistration
-- 
1.4.4.1.GIT


From halr at voltaire.com  Mon Dec 18 13:43:41 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 16:43:41 -0500
Subject: [openib-general] [PATCH TRIVIAL] opensm/autogen.sh: error
	message fix
In-Reply-To: <20061218200706.GA12834@sashak.voltaire.com>
References: <20061218200706.GA12834@sashak.voltaire.com>
Message-ID: <1166478195.32666.203147.camel@hal.voltaire.com>

On Mon, 2006-12-18 at 15:07, Sasha Khapyorsky wrote:
> Trivial error message fixes in osm/autogen.sh
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Mon Dec 18 13:47:47 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 16:47:47 -0500
Subject: [openib-general] [PATCH] osm: state manager return wrong signal
In-Reply-To: <1166472919660-git-send-email-eitan@mellanox.co.il>
References: <1166472919660-git-send-email-eitan@mellanox.co.il>
Message-ID: <1166478410.32666.203255.camel@hal.voltaire.com>

On Mon, 2006-12-18 at 15:15, eitan at mellanox.co.il wrote:
> From: Eitan Zahavi <eitan at sw053.yok.mtl.com>

See below comments.

> diff --git a/osm/opensm/osm_state_mgr.c b/osm/opensm/osm_state_mgr.c
> index 9eac038..94cc095 100644
> --- a/osm/opensm/osm_state_mgr.c
> +++ b/osm/opensm/osm_state_mgr.c
> @@ -1853,6 +1853,7 @@ osm_state_mgr_process(
>  {
>     ib_api_status_t status;
>     osm_remote_sm_t *p_remote_sm;
> +   osm_signal_t tmp_signal;
>  
>     CL_ASSERT( p_mgr );
>  
> @@ -2075,11 +2076,10 @@ osm_state_mgr_process(
>           case OSM_SIGNAL_CHANGE_DETECTED:
>              /*
>               * Nothing to do here.  One subnet change typcially
> -             * begets another....
> +             * begets another.... But needs to wait for all transactions

This was already done as part of your original osm_state_mgr.c patch.

>               */
>              signal = OSM_SIGNAL_NONE;

This was eliminated as part of your original osm_state_mgr.c patch.
Should it be there ? If so, this isn't indicated as a +.

-- Hal

>              break;
> -
>           case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
>              /*
>               * A change was detected on the subnet.
> @@ -2219,7 +2219,10 @@ osm_state_mgr_process(
>              signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
>  
>              /* the returned signal is always DONE */
> -            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> +            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> +
> +            if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
> +               signal = OSM_SIGNAL_DONE_PENDING;
>  
>              /* try to restore SA DB (this should be before lid_mgr
>                 because we may want to disable clients reregistration


From eitan at mellanox.co.il  Mon Dec 18 13:55:58 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 18 Dec 2006 23:55:58 +0200
Subject: [openib-general] [PATCH] osm: state manager return wrong signal
Message-ID: <6C2C79E72C305246B504CBA17B5500C980BFED@mtlexch01.mtl.com>

Hi Hal,

The discrepancies are due to my lack of git practice. I do not know why
these lines got back in.
The following line should not be there:

> >               */
> >              signal = OSM_SIGNAL_NONE;
> 
> This was eliminated as part of your original osm_state_mgr.c patch.
> Should it be there ? If so, this isn't indicated as a +.
> 
> -- Hal
> 
> >              break;
> > -
> >           case OSM_SIGNAL_NO_PENDING_TRANSACTIONS:
> >              /*
> >               * A change was detected on the subnet.
> > @@ -2219,7 +2219,10 @@ osm_state_mgr_process(
> >              signal = osm_pkey_mgr_process( p_mgr->p_subn->p_osm );
> >
> >              /* the returned signal is always DONE */
> > -            signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> > +            tmp_signal = osm_qos_setup(p_mgr->p_subn->p_osm);
> > +
> > +            if (tmp_signal == OSM_SIGNAL_DONE_PENDING)
> > +               signal = OSM_SIGNAL_DONE_PENDING;
> >
> >              /* try to restore SA DB (this should be before lid_mgr
> >                 because we may want to disable clients
reregistration


From halr at voltaire.com  Mon Dec 18 14:49:47 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 18 Dec 2006 17:49:47 -0500
Subject: [openib-general] [PATCH] osm: state manager ignores some
 outstanding transaction
In-Reply-To: <45870B8B.9010002@mellanox.co.il>
References: <45870B8B.9010002@mellanox.co.il>
Message-ID: <1166482096.32666.205256.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-12-18 at 16:43, Eitan Zahavi wrote:
> Hi Hal,
> 
> This is a resend as I did not see a bounce of the list of the previous 
> posting I did using git-send-email (probably due to a miss use).
> 
> The following patch fixes bugs in the state manager:
> Both in light sweep and pkey assignment states the state manager could ignore
> outstanding SMPs (reported back by the managers) and continue to next stage.
> When these SMPs do complete it causes failures of further steps which receives 
> the NO_PENDING_TRANSACTIONS signal when it is not expected. 
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied.

Due to the confusion, please double check the result.

-- Hal


From kliteyn at mellanox.co.il  Mon Dec 18 15:33:05 2006
From: kliteyn at mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 01:33:05 +0200
Subject: [openib-general] OSM: Using lid matrices in ucast manager
Message-ID: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>

Hi Hal.
 
I have a question about some patch that I want to send regarding lid
matrices usage in osm ucast
manager:
 
The FatTree routing doesn't use the min hop tables, so we can skip the
lid matrices building in OSM.
However, ucast manager uses these lid matrices also to get the max lid
that is accessible from each
switch, which defines the LTF table size.
This max lid is obtained by calling osm_switch_get_max_lid_ho()
function, which in turn, calls 
osm_lid_matrix_get_max_lid_ho() for the switch's lid matrix.
If the lid matrices weren't built, then the  osm_switch_get_max_lid_ho()
function will return 0xFFFF,
and eventually osm will crash.
 
Of course, I don't want to build all the lid matrices just to know the
max lid, so here's what I've done:
 
*	I added a field to the osm_switch_t object: max_lid_ho (with
default value 0xFFFF, should it 
be 0x0 instead?).
*	Added and three osm_switch_t methods for this new field: getter,
setter, and is_set that returns
true if this field has been set.
*	The original osm_switch_get_max_lid_ho() has been updated to
return this field value if it's set.
*	Then in FatTree routing I set this field for each switch (I get
the max lid 'for free' as a byproduct
of the algorithm).
*	Now everything in the ucast manager works fine, except for the
following two dump functions:
        __osm_ucast_mgr_dump_ucast_routes (it uses hops)
        ucast_mgr_dump_lid_matrix (obviously...)
These two functions check at the beginning whether the max_lid_ho was
set (using the 'is_set'
method), and return w/o printing anything if the answer is yes.
 
This way any other routing engine that uses lid matrix is not affected
by this change, and any routing 
engine that doesn't use the lid matrix has a way to set the max lid per
switch explicitly.
 
This approach works great, but I have a feeling that this is kinda
hack...
 
What do you think about this solution?
Any other suggestions?
 
Anyway, just wanted to hear your opinion before sending the patch.
   
Regards,
 
Yevgeny Kliteynik
 
Mellanox Technologies LTD
Tel: +972-4-909-7200 ext: 394
Fax: +972-4-959-3245
P.O. Box 586 Yokneam 20692 ISRAEL 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/1c485a48/attachment.html>

From sashak at voltaire.com  Mon Dec 18 17:30:56 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 19 Dec 2006 03:30:56 +0200
Subject: [openib-general] OSM: Using lid matrices in ucast manager
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
Message-ID: <1166491856.29306.15.camel@localhost>

Hi Yevgeny,

On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
>  
> 
> I have a question about some patch that I want to send regarding lid
> matrices usage in osm ucast
> 
> manager:
> 
>  
> 
> The FatTree routing doesn’t use the min hop tables, so we can skip the
> lid matrices building in OSM.

The lid matrices are used in mcast_mgr for multicast routes generation.

> However, ucast manager uses these lid matrices also to get the max lid
> that is accessible from each
> 
> switch, which defines the LTF table size.
> 
> This max lid is obtained by calling osm_switch_get_max_lid_ho()
> function, which in turn, calls 
> 
> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix.
> 
> If the lid matrices weren’t built, then the
>  osm_switch_get_max_lid_ho() function will return 0xFFFF,
> 
> and eventually osm will crash.
> 
>  
> 
> Of course, I don’t want to build all the lid matrices just to know the
> max lid, so here’s what I’ve done:
> 
>  
> 
>       * I added a field to the osm_switch_t object: max_lid_ho (with
>         default value 0xFFFF, should it 
>         be 0x0 instead?).

Good thing. 0 is fine as default value IMHO.

>       * Added and three osm_switch_t methods for this new field:
>         getter, setter, and is_set that returns
>         true if this field has been set.

Why those methods? Everything you need is to access structure field and
'if (sw->max_lid_ho)' for "is_set" checks.

>       * The original osm_switch_get_max_lid_ho() has been updated to
>         return this field value if it’s set.
>       * Then in FatTree routing I set this field for each switch (I
>         get the max lid ‘for free’ as a byproduct
>         of the algorithm).
>       * Now everything in the ucast manager works fine, except for the
>         following two dump functions:
>                 __osm_ucast_mgr_dump_ucast_routes (it uses hops)
>                 ucast_mgr_dump_lid_matrix (obviously…)
>         These two functions check at the beginning whether the
>         max_lid_ho was set (using the ‘is_set’
>         method), and return w/o printing anything if the answer is
>         yes.
> 
>  
> 
> This way any other routing engine that uses lid matrix is not affected
> by this change, and any routing 
> 
> engine that doesn’t use the lid matrix has a way to set the max lid
> per switch explicitly.

Hope you are adding this for existing code.

> This approach works great, but I have a feeling that this is kinda
> hack…

Moving max_lid(_ho) to switch structure looks like a good idea for me
regardless to lid matrix build elimination.

The only problem I can see with lid matrices is mcast_mgr which uses
this.

Sasha

> 
>  
> 
> What do you think about this solution?
> 
> Any other suggestions?
> 
>  
> 
> Anyway, just wanted to hear your opinion before sending the patch.
> 
>    
> 
> Regards,
> 
>  
> 
> Yevgeny Kliteynik
> 
>  
> 
> Mellanox Technologies LTD
> 
> Tel: +972-4-909-7200 ext: 394
> 
> Fax: +972-4-959-3245
> 
> P.O. Box 586 Yokneam 20692 ISRAEL 
> 
>  
> 
> 


From Ashish.Batwara at lsi.com  Mon Dec 18 18:55:43 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Mon, 18 Dec 2006 19:55:43 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>

Hi,
I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
connected to IB Switch. ibnodes command displays the information about
the Switch ports and HCA ports.
When I start opensm, I see in /var/log/messages "Starting srp_daemon"
for all the 4 ports and immediately after I see "failed srp_daemon" for
all the ports and the displays "SM Port is down".

I tried several times and even rebooted the server few times but no
luck.

Does anybody know what this problem is?

Thanks
Ashish


From eitan at sw053.yok.mtl.com  Mon Dec 18 21:23:57 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Tue, 19 Dec 2006 07:23:57 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-19:normal completion
Message-ID: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3
ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
Total=308 Pass=307 Fail=1

Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
41 OsmStress IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo

Failures:
1 OsmStress IS1-16.topo


From vlad at dev.mellanox.co.il  Mon Dec 18 23:42:38 2006
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Tue, 19 Dec 2006 09:42:38 +0200
Subject: [openib-general] ofed backports update
In-Reply-To: <1166091556.926.17.camel@muscida>
References: <20061211144813.GA15870@mellanox.co.il>
	<1166091556.926.17.camel@muscida>
Message-ID: <458797EE.9050000@dev.mellanox.co.il>

Yosef Etigin wrote:
> On Mon, 2006-12-11 at 16:48 +0200, Michael S. Tsirkin wrote:
>   
>> Here's a small update on OFED 1.2 backports. This describes a change
>> I did a couple of weeks ago but never got to documenting.
>> NOTE: This info is relevant only for people developing OFED kernel code,
>> everything is transparent for others.
>>
>> NOTE: This is by *no means* a comprehensive writeup of OFED build process -
>> just a small update for people familiar with development in OFED 1.1.
>>
>> Background:
>> OFED 1.1 did all backports by applying patches under
>> kernel_patches/backports/<kernel version>/ directory.
>> To back-port a package, you just stuck a patch there
>> and one OFED detected an appropriate kernel, it was applied before build.
>> In many cases - where the kernel we are back-porting to was simply
>> missing some macro - what patch actually did was just add a file
>> under the include directory, and OFED build scripts knew to pick
>> these up before standard linux includes.
>> Managing these became somewhat of a pain as it is often hard to
>> see the history of a patch: try git diff on a patch that sits in git tree
>> and see what I mean.
>>
>> Update:
>> So for OFED 1.2 I've created a new directory kernel_addons, and converted
>> all patches that created new files to plain files under the relevant
>> kernel directory.  OFED scripts now look there for files before standard
>> Linux headers.
>> For an example, look at how backport to 2.6.18 looks:
>> http://staging.openfabrics.org/git/?p=~vlad/ofed_1_2/.git;a=tree;f=kernel_addons/backport/2.6.18/include/linux;h=5eabed1f98596f92ce149dae65c4ab1ceb1d6a67;hb=HEAD
>> Unfortunately, not all patches are of this form - some really tweak source
>> inside the infiniband subtree - but we can strive to reduce the number of this
>> and in this way make maintaining backports more of a seamless process.
>>
>> Bottom line
>> There are now 2 mechanisms for back-porting in OFED:
>> - if you want to add a kernel-specific file, stick it under
>>   kernel_addons/backport/<kernel-version>/.
>> - if you must change an existing file depending on kernel version, stick
>>   a patch in kernel_patches/backports/<kernel version>/.
>>
>>     
>
> I was running the ‘configure’ script under ofed root.
>
> In ofed 1.1, it is possible to run configure without flags to patch the
> sources, and then run it again –without-patches and with the desired
> flags.
>
> In ofed 1.2 (Vlad’s tree) this scenario causes compilation error while
> running ‘make’ afterwards (2.6.9-34ELsmp and on 2.6.16.21-0.8, but NOT
> 2.6.19) causes compilation errors later on.
>
> However, when I just ran configure on a fresh source, with all the
> desired flags, it worked just fine.
>
> It seems to happen because the configure only patches Makefiles with the
> selected components with the kernel-addons include path.
>
> Maybe it should patch all Makefiles, or copy the files to ./include?
>
>
> _______________________________________________________________
> Yosef Etigin, ib-host-stack
> Voltaire – The Grid Backbone
> www.voltaire.com
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   
I fixed configure script. Please try again.

Regards,
Vladimir


From eitan at mellanox.co.il  Mon Dec 18 23:37:55 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 09:37:55 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-19:normal
 completion
In-Reply-To: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com>
References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com>
Message-ID: <458796D3.50709@mellanox.co.il>

Clarifications:

1. The OpenSM code run includes the last patches I have sent.
2. The single failure is due to a race in ibmgtsim. ibdiagnet waits 
forever for a response for a "bind" message.
    I suspect a deadlock between the "server" and the "node" but I am 
not sure.
3. The regression still does not run the osmtest tests due to the fact 
they are all failing.

EZ

Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3
> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
> Total=308 Pass=307 Fail=1
>
> Pass:
> 42 Stability IS1-16.topo
> 42 Pkey IS1-16.topo
> 42 Multicast IS1-16.topo
> 42 LidMgr IS1-16.topo
> 41 OsmStress IS1-16.topo
> 14 Stability IS3-loop.topo
> 14 Stability IS3-128.topo
> 14 Pkey IS3-128.topo
> 14 OsmStress IS3-128.topo
> 14 Multicast IS3-loop.topo
> 14 Multicast IS3-128.topo
> 14 LidMgr IS3-128.topo
>
> Failures:
> 1 OsmStress IS1-16.topo
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Mon Dec 18 23:42:28 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 09:42:28 +0200
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
Message-ID: <458797E4.8010600@mellanox.co.il>

This is not an OpenSM issue.
Forwarded to the SRP people.

EZ
Batwara, Ashish wrote:
> Hi,
> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
> connected to IB Switch. ibnodes command displays the information about
> the Switch ports and HCA ports.
> When I start opensm, I see in /var/log/messages "Starting srp_daemon"
> for all the 4 ports and immediately after I see "failed srp_daemon" for
> all the ports and the displays "SM Port is down".
>
> I tried several times and even rebooted the server few times but no
> luck.
>
> Does anybody know what this problem is?
>
> Thanks
> Ashish
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Tue Dec 19 00:20:35 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 10:20:35 +0200
Subject: [openib-general] [PATCH obvious] IB/verbs: fix 32-bit big endian
	platforms
Message-ID: <20061219082035.GA24028@mellanox.co.il>

ib_dma_alloc_coherent, introduced by 
commit 9b513090a3c5e4964f9ac09016c1586988abb3d5
is storing dma_handle through a pointer to u64.

This is broken on big-endian 32 bit platforms since the handle will land in
high-order bits of the qword.  And the compiler actually warns about passing
argument 3 of dma_alloc_coherent from incompatible pointer type.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3c2e105..4214908 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1637,9 +1637,14 @@ static inline void *ib_dma_alloc_coherent(struct ib_device *dev,
 					   u64 *dma_handle,
 					   gfp_t flag)
 {
+	dma_addr_t a;
+	void *ptr;
 	if (dev->dma_ops)
 		return dev->dma_ops->alloc_coherent(dev, size, dma_handle, flag);
-	return dma_alloc_coherent(dev->dma_device, size, dma_handle, flag);
+	ptr = dma_alloc_coherent(dev->dma_device, size, &a, flag);
+	if (ptr)
+		*dma_handle = a;
+	return ptr;
 }
 
 /**

-- 
MST


From mst at mellanox.co.il  Tue Dec 19 00:33:25 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 10:33:25 +0200
Subject: [openib-general] [PATCH] IB/verbs: include linux/kref.h explicitly
Message-ID: <20061219083325.GA24952@mellanox.co.il>

ib_verbs.h uses struct kref so it should include linux/kref.h directly.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3c2e105..69a0a11 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -45,6 +45,7 @@
 #include <linux/device.h>
 #include <linux/mm.h>
 #include <linux/dma-mapping.h>
+#include <linux/kref.h>
 
 #include <asm/atomic.h>
 #include <asm/scatterlist.h>

-- 
MST


From mst at mellanox.co.il  Tue Dec 19 00:35:58 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 10:35:58 +0200
Subject: [openib-general] [PATCH] IB/mthca: fix FMR breakage introduced by
 kmemdup cleanup
Message-ID: <20061219083558.GA25036@mellanox.co.il>

This reverts mthca breakage intruduced by commit
bed8bdfddd851657cf9e5fd16bb44abb02ae7f42 :
kmemdup can not be used here since lengths passed to kmalloc/memcpy
are not the same.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 7ec7c4b..7b96751 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -1100,10 +1100,11 @@ static struct ib_fmr *mthca_alloc_fmr(struct ib_pd *pd, int mr_access_flags,
 	struct mthca_fmr *fmr;
 	int err;
 
-	fmr = kmemdup(fmr_attr, sizeof *fmr, GFP_KERNEL);
+	fmr = kmalloc(sizeof *fmr, GFP_KERNEL);
 	if (!fmr)
 		return ERR_PTR(-ENOMEM);
 
+	memcpy(&fmr->attr, fmr_attr, sizeof *fmr_attr);
 	err = mthca_fmr_alloc(to_mdev(pd->device), to_mpd(pd)->pd_num,
 			     convert_access(mr_access_flags), fmr);
 
-- 
MST


From mst at mellanox.co.il  Tue Dec 19 00:52:36 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 10:52:36 +0200
Subject: [openib-general] [PATCH] IB/mthca: fix FMR breakage introduced
 by kmemdup cleanup
In-Reply-To: <20061219083558.GA25036@mellanox.co.il>
References: <20061219083558.GA25036@mellanox.co.il>
Message-ID: <20061219085236.GD25243@mellanox.co.il>

> This reverts mthca breakage intruduced by commit
> bed8bdfddd851657cf9e5fd16bb44abb02ae7f42 :
> kmemdup can not be used here since lengths passed to kmalloc/memcpy
> are not the same.
> 
> Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

This was reported by Dotan Barak <dotanb at mellanox.co.il>

-- 
MST


From ogerlitz at voltaire.com  Tue Dec 19 01:14:41 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 11:14:41 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
Message-ID: <4587AD81.2010703@voltaire.com>

Bernadat, Philippe wrote:
> So after a bit more testing, setting the route path mtu to 1024 before
> the qp creation (rdma_create_qp()) seems sufficient.

sure, rdma_create_qp is called on the create_conn flow which is executed 
after getting RDMA_CM_EVENT_ROUTE_RESOLVED as i suggested...

OK, so where we are now, what is the current bw matrix (voltaire/ofed 
fmr/no-fmr)?

Or.


From philippe_bernadat at hp.com  Tue Dec 19 01:58:38 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Tue, 19 Dec 2006 10:58:38 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <4587AD81.2010703@voltaire.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net>

 
Hi Or,

I didn't have time to re-run then non FMR cases.
For FMR VIB and OFED are comparable.
But I will e-run tests for all cases

Right now I am fighting with ib_query_device() that crashes the kernel !
Trying to use this to test the HCA type.

Philippe

> -----Original Message-----
> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> Sent: Tuesday, December 19, 2006 10:15 AM
> To: Bernadat, Philippe
> Cc: Roland Dreier; openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> Bernadat, Philippe wrote:
> > So after a bit more testing, setting the route path mtu to 
> 1024 before
> > the qp creation (rdma_create_qp()) seems sufficient.
> 
> sure, rdma_create_qp is called on the create_conn flow which 
> is executed 
> after getting RDMA_CM_EVENT_ROUTE_RESOLVED as i suggested...
> 
> OK, so where we are now, what is the current bw matrix (voltaire/ofed 
> fmr/no-fmr)?
> 
> Or.
> 
> 


From tziporet at dev.mellanox.co.il  Tue Dec 19 02:04:53 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 19 Dec 2006 12:04:53 +0200
Subject: [openib-general] SRP problem: srp_daemon failure (was: opensm)
In-Reply-To: <458797E4.8010600@mellanox.co.il>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
	<458797E4.8010600@mellanox.co.il>
Message-ID: <4587B945.6060700@dev.mellanox.co.il>

Eitan Zahavi wrote:
> This is not an OpenSM issue.
> Forwarded to the SRP people.
>
> EZ
> Batwara, Ashish wrote:
>   
>> Hi,
>> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
>> connected to IB Switch. ibnodes command displays the information about
>> the Switch ports and HCA ports.
>> When I start opensm, I see in /var/log/messages "Starting srp_daemon"
>> for all the 4 ports and immediately after I see "failed srp_daemon" for
>> all the ports and the displays "SM Port is down".
>>
>> I tried several times and even rebooted the server few times but no
>> luck.
>>
>> Does anybody know what this problem is?
>>
>> Thanks
>> Ashish
>>     
>
Changed the subject for SRP people to be aware of the problem.

Tziporet


From tziporet at dev.mellanox.co.il  Tue Dec 19 02:19:44 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 19 Dec 2006 12:19:44 +0200
Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa
 unload
In-Reply-To: <1166458881.9289.17.camel@muscida>
References: <1166458881.9289.17.camel@muscida>
Message-ID: <4587BCC0.1020104@dev.mellanox.co.il>

Yosef Etigin wrote:
> This is a fix to Sean's multicast patches for ofed 1.2.
>
> The issuse is described in: 
> http://www.mail-archive.com/openib-general at openib.org/msg27097.html
>
> The Oops happened because the multicast work handler was called
> after the multicast device structure was released. It happened because
> the multicast cleanup function 'mcast_remove_one' didn't wait for
> work queue completion on all ports before releasing the device, but 
> only N-1 ports.
>
> The patch applies after Sean's multicast patch series.
>
>   
Hi Yosef,
Very good that you found this bug.
Since Sean on vacation can you create a patch of the multicast module 
against the new code base of OFED (kernel 2.6.20-rc1)

Thanks,
Tziporet


From ogerlitz at voltaire.com  Tue Dec 19 02:44:13 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 12:44:13 +0200
Subject: [openib-general] [PATCH] ib_sa: Fix kernel Oops caused by ib_sa
 unload
In-Reply-To: <4587BCC0.1020104@dev.mellanox.co.il>
References: <1166458881.9289.17.camel@muscida>
	<4587BCC0.1020104@dev.mellanox.co.il>
Message-ID: <4587C27D.50902@voltaire.com>

Tziporet Koren wrote:
> Yosef Etigin wrote:
>> This is a fix to Sean's multicast patches for ofed 1.2.
>>
>> The issuse is described in: 
>> http://www.mail-archive.com/openib-general at openib.org/msg27097.html
>>
>> The Oops happened because the multicast work handler was called
>> after the multicast device structure was released. It happened because
>> the multicast cleanup function 'mcast_remove_one' didn't wait for
>> work queue completion on all ports before releasing the device, but 
>> only N-1 ports.
>>
>> The patch applies after Sean's multicast patch series.
>>
>>   
> Hi Yosef,
> Very good that you found this bug.
> Since Sean on vacation can you create a patch of the multicast module 
> against the new code base of OFED (kernel 2.6.20-rc1)

I don't think this is possible since as Michael has said, Sean has to 
rebase the multicast patches on top of 2.6.20-rc1 (the v2 patch series 
was based on 2.6.19 and v3 was the one that merged which is w.o them).

Or.


From ogerlitz at voltaire.com  Tue Dec 19 03:03:36 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 13:03:36 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net>
Message-ID: <4587C708.2010700@voltaire.com>

Bernadat, Philippe wrote:
> I didn't have time to re-run then non FMR cases.
> For FMR VIB and OFED are comparable.
> But I will e-run tests for all cases

my main concern is FMR/no-FMR for OFED, FMR should be at least good as 
no-FMR and if this is not the case, lets look into that.

> Right now I am fighting with ib_query_device() that crashes the kernel !
> Trying to use this to test the HCA type.

so your approach is:

	if (the-active-side-is-mlx-tavor)
		then set-path-mtu-to-1024

then this is a bug, since the only **active** side sets the path mtu, 
where the passive side (SFS ...) might be mlx-tavor and the active side 
can be something else, and the tavor mtu bug will hit you.

I am thinking what is the correct way to approach the problem, at the 
cma level, there will be probably some discussion here.

saying all the above - ib_query_device must not crash  the kernel!
make sure that in case there is some issue, please report it here.

Or.


From eitan at mellanox.co.il  Tue Dec 19 03:17:42 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 13:17:42 +0200
Subject: [openib-general] opensm
In-Reply-To: <458797E4.8010600@mellanox.co.il>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
	<458797E4.8010600@mellanox.co.il>
Message-ID: <4587CA56.9080906@mellanox.co.il>

Hi Ashish,

SRP people say they have no such error message.
OpenSM does. So I take it back.

Ashish,
Please provide more into:

1. ibv_devinfo
2. Version of code you are using
3. Command line you use for starting opensm
4. /var/log/osm.log

Thanks and sorry for the confusion.

EZ

Eitan Zahavi wrote:
> This is not an OpenSM issue.
> Forwarded to the SRP people.
>
> EZ
> Batwara, Ashish wrote:
>   
>> Hi,
>> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
>> connected to IB Switch. ibnodes command displays the information about
>> the Switch ports and HCA ports.
>> When I start opensm, I see in /var/log/messages "Starting srp_daemon"
>> for all the 4 ports and immediately after I see "failed srp_daemon" for
>> all the ports and the displays "SM Port is down".
>>
>> I tried several times and even rebooted the server few times but no
>> luck.
>>
>> Does anybody know what this problem is?
>>
>> Thanks
>> Ashish
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Tue Dec 19 03:59:48 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 06:59:48 -0500
Subject: [openib-general] OSM: Using lid matrices in ucast manager
In-Reply-To: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
Message-ID: <1166529491.32666.241847.camel@hal.voltaire.com>

Hi Yevgeny,

On Mon, 2006-12-18 at 18:33, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
>  
> 
> I have a question about some patch that I want to send regarding lid
> matrices usage in osm ucast
> 
> manager:
> 
>  
> 
> The FatTree routing doesn’t use the min hop tables, so we can skip the
> lid matrices building in OSM.
> 
> However, ucast manager uses these lid matrices also to get the max lid
> that is accessible from each
> 
> switch, which defines the LTF table size.
> 
> This max lid is obtained by calling osm_switch_get_max_lid_ho()
> function, which in turn, calls 
> 
> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix.
> 
> If the lid matrices weren’t built, then the
> osm_switch_get_max_lid_ho() function will return 0xFFFF,
> 
> and eventually osm will crash.
> 
>  
> 
> Of course, I don’t want to build all the lid matrices just to know the
> max lid, so here’s what I’ve done:
> 
>  
> 
>       * I added a field to the osm_switch_t object: max_lid_ho (with
>         default value 0xFFFF, should it 
>         be 0x0 instead?).

0 seems better to me but I'm not sure what else this impacts.

Note also there are other 0xffff initializations similar to this which
IMO are also candidates for change :-(

>       * Added and three osm_switch_t methods for this new field:
>         getter, setter, and is_set that returns
>         true if this field has been set.

Is is_set really needed ?

>       * The original osm_switch_get_max_lid_ho() has been updated to
>         return this field value if it’s set.
>       * Then in FatTree routing I set this field for each switch (I
>         get the max lid ‘for free’ as a byproduct
>         of the algorithm).
>       * Now everything in the ucast manager works fine, except for the
>         following two dump functions:
>                 __osm_ucast_mgr_dump_ucast_routes (it uses hops)
>                 ucast_mgr_dump_lid_matrix (obviously…)
>         These two functions check at the beginning whether the
>         max_lid_ho was set (using the ‘is_set’
>         method), and return w/o printing anything if the answer is
>         yes.

Perhaps a dump routine is a routine which each routing protocol should
supply ?

-- Hal

> This way any other routing engine that uses lid matrix is not affected
> by this change, and any routing 
> 
> engine that doesn’t use the lid matrix has a way to set the max lid
> per switch explicitly.
> 
>  
> 
> This approach works great, but I have a feeling that this is kinda
> hack…
> 
>  
> 
> What do you think about this solution?
> 
> Any other suggestions?
> 
>  
> 
> Anyway, just wanted to hear your opinion before sending the patch.
> 
>    
> 
> Regards,
> 
>  
> 
> Yevgeny Kliteynik
> 
>  
> 
> Mellanox Technologies LTD
> 
> Tel: +972-4-909-7200 ext: 394
> 
> Fax: +972-4-959-3245
> 
> P.O. Box 586 Yokneam 20692 ISRAEL 
> 
>  
> 
> 


From halr at voltaire.com  Tue Dec 19 04:05:56 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 07:05:56 -0500
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
Message-ID: <1166529940.32666.242119.camel@hal.voltaire.com>

Hi Ashish,

On Mon, 2006-12-18 at 21:55, Batwara, Ashish wrote:
> Hi,
> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
> connected to IB Switch. ibnodes command displays the information about
> the Switch ports and HCA ports.
> When I start opensm, I see in /var/log/messages "Starting srp_daemon"
> for all the 4 ports and immediately after I see "failed srp_daemon" for
> all the ports and the displays "SM Port is down".

"SM Port down" means there is no physical link between the SM port and
it's peer. Can you investigate and fix this ?

-- Hal

> I tried several times and even rebooted the server few times but no
> luck.
> 
> Does anybody know what this problem is?
> 
> Thanks
> Ashish
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From mst at mellanox.co.il  Tue Dec 19 04:10:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 14:10:42 +0200
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <4587C708.2010700@voltaire.com>
References: <3F3894AC7A13B04E83CEBC95CFD3047E05571759@idaexc03.emea.cpqcorp.net>
	<4587C708.2010700@voltaire.com>
Message-ID: <20061219121042.GB30743@mellanox.co.il>

> I am thinking what is the correct way to approach the problem, at the 
> cma level, there will be probably some discussion here.

I guess the right thing to do for now would be to fix the cma tavor quirk patch.
But the real solution is in the SA - tricks in cma are just a partial work-around.

-- 
MST


From halr at voltaire.com  Tue Dec 19 04:09:17 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 07:09:17 -0500
Subject: [openib-general] OSM: Using lid matrices in ucast manager
In-Reply-To: <1166491856.29306.15.camel@localhost>
References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
	<1166491856.29306.15.camel@localhost>
Message-ID: <1166530077.32666.242175.camel@hal.voltaire.com>

Hi Yevgeny & Sasha,

On Mon, 2006-12-18 at 20:30, Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote:
> > Hi Hal.
> > 
> >  
> > 
> > I have a question about some patch that I want to send regarding lid
> > matrices usage in osm ucast
> > 
> > manager:
> > 
> >  
> > 
> > The FatTree routing doesn’t use the min hop tables, so we can skip the
> > lid matrices building in OSM.
> 
> The lid matrices are used in mcast_mgr for multicast routes generation.

Good point but fat tree seems to work for multicast (at least in my
subnet). How could that be ?

-- Hal

> > However, uca-st manager uses these lid matrices also to get the max lid
> > that is accessible from each
> > 
> > switch, which defines the LTF table size.
> > 
> > This max lid is obtained by calling osm_switch_get_max_lid_ho()
> > function, which in turn, calls 
> > 
> > osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix.
> > 
> > If the lid matrices weren’t built, then the
> >  osm_switch_get_max_lid_ho() function will return 0xFFFF,
> > 
> > and eventually osm will crash.
> > 
> >  
> > 
> > Of course, I don’t want to build all the lid matrices just to know the
> > max lid, so here’s what I’ve done:
> > 
> >  
> > 
> >       * I added a field to the osm_switch_t object: max_lid_ho (with
> >         default value 0xFFFF, should it 
> >         be 0x0 instead?).
> 
> Good thing. 0 is fine as default value IMHO.
> 
> >       * Added and three osm_switch_t methods for this new field:
> >         getter, setter, and is_set that returns
> >         true if this field has been set.
> 
> Why those methods? Everything you need is to access structure field and
> 'if (sw->max_lid_ho)' for "is_set" checks.
> 
> >       * The original osm_switch_get_max_lid_ho() has been updated to
> >         return this field value if it’s set.
> >       * Then in FatTree routing I set this field for each switch (I
> >         get the max lid ‘for free’ as a byproduct
> >         of the algorithm).
> >       * Now everything in the ucast manager works fine, except for the
> >         following two dump functions:
> >                 __osm_ucast_mgr_dump_ucast_routes (it uses hops)
> >                 ucast_mgr_dump_lid_matrix (obviously…)
> >         These two functions check at the beginning whether the
> >         max_lid_ho was set (using the ‘is_set’
> >         method), and return w/o printing anything if the answer is
> >         yes.
> > 
> >  
> > 
> > This way any other routing engine that uses lid matrix is not affected
> > by this change, and any routing 
> > 
> > engine that doesn’t use the lid matrix has a way to set the max lid
> > per switch explicitly.
> 
> Hope you are adding this for existing code.
> 
> > This approach works great, but I have a feeling that this is kinda
> > hack…
> 
> Moving max_lid(_ho) to switch structure looks like a good idea for me
> regardless to lid matrix build elimination.
> 
> The only problem I can see with lid matrices is mcast_mgr which uses
> this.
> 
> Sasha
> 
> > 
> >  
> > 
> > What do you think about this solution?
> > 
> > Any other suggestions?
> > 
> >  
> > 
> > Anyway, just wanted to hear your opinion before sending the patch.
> > 
> >    
> > 
> > Regards,
> > 
> >  
> > 
> > Yevgeny Kliteynik
> > 
> >  
> > 
> > Mellanox Technologies LTD
> > 
> > Tel: +972-4-909-7200 ext: 394
> > 
> > Fax: +972-4-959-3245
> > 
> > P.O. Box 586 Yokneam 20692 ISRAEL 
> > 
> >  
> > 
> > 


From halr at voltaire.com  Tue Dec 19 04:12:24 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 07:12:24 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-19:normal
 completion
In-Reply-To: <458796D3.50709@mellanox.co.il>
References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com>
	<458796D3.50709@mellanox.co.il>
Message-ID: <1166530303.32666.242284.camel@hal.voltaire.com>

On Tue, 2006-12-19 at 02:37, Eitan Zahavi wrote:
> Clarifications:
> 
> 1. The OpenSM code run includes the last patches I have sent.
> 2. The single failure is due to a race in ibmgtsim. ibdiagnet waits 
> forever for a response for a "bind" message.
>     I suspect a deadlock between the "server" and the "node" but I am 
> not sure.
> 3. The regression still does not run the osmtest tests due to the fact 
> they are all failing.

Is this due to the one issue with InformInfo ?

-- Hal

> EZ
> 
> Eitan Zahavi wrote:
> > OSM Simulation Regression Summary
> > OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3
> > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
> > Total=308 Pass=307 Fail=1
> >
> > Pass:
> > 42 Stability IS1-16.topo
> > 42 Pkey IS1-16.topo
> > 42 Multicast IS1-16.topo
> > 42 LidMgr IS1-16.topo
> > 41 OsmStress IS1-16.topo
> > 14 Stability IS3-loop.topo
> > 14 Stability IS3-128.topo
> > 14 Pkey IS3-128.topo
> > 14 OsmStress IS3-128.topo
> > 14 Multicast IS3-loop.topo
> > 14 Multicast IS3-128.topo
> > 14 LidMgr IS3-128.topo
> >
> > Failures:
> > 1 OsmStress IS1-16.topo
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From halr at voltaire.com  Tue Dec 19 04:21:21 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 07:21:21 -0500
Subject: [openib-general] SRP problem: srp_daemon failure (was: opensm)
In-Reply-To: <4587B945.6060700@dev.mellanox.co.il>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159AA5@NAMAIL2.ad.lsil.com>
	<458797E4.8010600@mellanox.co.il> <4587B945.6060700@dev.mellanox.co.il>
Message-ID: <1166530801.32666.242659.camel@hal.voltaire.com>

On Tue, 2006-12-19 at 05:04, Tziporet Koren wrote:
> Eitan Zahavi wrote:
> > This is not an OpenSM issue.
> > Forwarded to the SRP people.
> >
> > EZ
> > Batwara, Ashish wrote:
> >   
> >> Hi,
> >> I am trying to run opensm on Linux server. It has two HCAs (4-ports) and
> >> connected to IB Switch. ibnodes command displays the information about
> >> the Switch ports and HCA ports.
> >> When I start opensm, I see in /var/log/messages "Starting srp_daemon"
> >> for all the 4 ports and immediately after I see "failed srp_daemon" for
> >> all the ports and the displays "SM Port is down".
> >>
> >> I tried several times and even rebooted the server few times but no
> >> luck.
> >>
> >> Does anybody know what this problem is?
> >>
> >> Thanks
> >> Ashish
> >>     
> >
> Changed the subject for SRP people to be aware of the problem.

Not the first level issue but shouldn't the srp_daemon be able to come
up without the SM or without the SM port up ?

-- Hal

> Tziporet
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From mst at mellanox.co.il  Tue Dec 19 04:24:53 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 14:24:53 +0200
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net>
	<3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
Message-ID: <20061219122453.GC30743@mellanox.co.il>

> So after a bit more testing, setting the route path mtu to 1024 before
> the qp creation (rdma_create_qp()) seems sufficient.

OK, so the following fixes the tavor_quirk flag in cma to actually do something.
Could you please replace the patch cma_tavor_quirk.patch with this one,
set tavor_quirk option for cma module, and see if this works as expected?

Unpack OFED 1.1, copy the following to
OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch
removing the patch by the same name that is in OFED
(also remove xxx_cma_tavor_quirk.txt or other patches if you put them there)
and then pack OFED 1.1 and rebuild.


Thanks,

-----------------

Tavor systems get better performance with 1K MTU. Since there does
not seem to be any way to find out whether the remote system uses Tavor,
add an option to limit the MTU globally.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 50150c8..261bf45 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
 
+static int tavor_quirk = 0;
+module_param_named(tavor_quirk, tavor_quirk, int, 0644);
+MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: limit MTU to 1K if > 0");
+
 #define CMA_CM_RESPONSE_TIMEOUT 20
 #define CMA_MAX_CM_RETRIES 3
 
@@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms,
 {
 	struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
 	struct ib_sa_path_rec path_rec;
+	ib_sa_comp_mask mask;
 
 	memset(&path_rec, 0, sizeof path_rec);
 	ib_addr_get_sgid(addr, &path_rec.sgid);
@@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct rdma_id_private *id_priv, int timeout_ms,
 	path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
 	path_rec.numb_path = 1;
 
+	if (tavor_quirk) {
+		path_rec.mtu_selector = IB_SA_LT;
+		path_rec.mtu = IB_MTU_2048;
+		mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
+	} else
+		mask = 0;
+
 	id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
-				id_priv->id.port_num, &path_rec,
+				id_priv->id.port_num, &path_rec, mask |
 				IB_SA_PATH_REC_DGID | IB_SA_PATH_REC_SGID |
 				IB_SA_PATH_REC_PKEY | IB_SA_PATH_REC_NUMB_PATH,
 				timeout_ms, GFP_KERNEL,

-- 
MST


From ogerlitz at voltaire.com  Tue Dec 19 04:37:31 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 14:37:31 +0200
Subject: [openib-general] tavor quirks etc
Message-ID: <4587DD0B.1030403@voltaire.com>

Basically, i think we should be going to the simple approach of having 
**one** quirk in the rdma cm kernel code saying:

	if (tavor_quirk)
		then route->path_rec->mtu = IB_MTU_1024

so users would have to set the quirk to true in the presence of tavor
HCA either in the active or passive side.

This patch should also go upstream.

The problems i see with the current approach are:

1) there are three patches

2) of them, the cma-tavor-quirk is broken (see *** below) in its design
since it assumes the opensm-tavor-quirk and it would not work with 
opensm that does not have it nor with 3rd party/commercial SMs which do 
not have similar quirk

3) the ipoib-selector patch (below) in a way assumes the open-sm quirk
and hence it was not pushed upstream, and vise-versa an upstream ipoib
code is broken with the open-sm running with the quirk!

(***) per 15.2.5.16 PATHRECORD, you should get from the SM "less
than MTU specified" in case it has such path.

Now, what does it means that "it has such path"??? looking in the opensm 
  code @ opensm/osm_sa_path_record.c :: __osm_pr_rcv_get_path_parms

you can see that when the tavor quirk patch is ***not*** applied the sm 
scans the path and for each port compares the port mtu to the requested 
mtu, such that at the end of the scan the path mtu is the minimal mtu 
reported along the path. and then apply this code:

> if ( ( comp_mask & IB_PR_COMPMASK_MTUSELEC ) &&
>        ( comp_mask & IB_PR_COMPMASK_MTU ) )
>   {
>     required_mtu = ib_path_rec_mtu( p_pr );
>     switch( ib_path_rec_mtu_sel( p_pr ) )
>     {
>     case 0:    /* must be greater than */
>       if( mtu <= required_mtu )
>         status = IB_NOT_FOUND;
>       break;
> 
>     case 1:    /* must be less than */
>       if( mtu >= required_mtu )
>         status = IB_NOT_FOUND;
>       break;

XXX - the cma_tavor_quirk is broken without the opensm-tavor-quirk

> 
>     case 2:    /* exact match */
>       if( mtu != required_mtu )
>         status = IB_NOT_FOUND;
>       break;
> 
>     case 3:    /* largest available */
>       /* can't be disqualified by this one */
>       break;

this is the ipoib-selector patch

> Index: ofed_1_1/drivers/infiniband/ulp/ipoib/ipoib_main.c
> ===================================================================
> --- ofed_1_1.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ ofed_1_1/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -182,6 +182,8 @@ static int ipoib_change_mtu(struct net_d
> 
>         dev->mtu = min(priv->mcast_mtu, priv->admin_mtu);
> 
> +       queue_work(ipoib_workqueue, &priv->flush_task);
> +
>         return 0;
>  }
> 
> @@ -452,15 +454,39 @@ static int path_rec_start(struct net_dev
>                           struct ipoib_path *path)
>  {
>         struct ipoib_dev_priv *priv = netdev_priv(dev);
> +       ib_sa_comp_mask comp_mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
> +
> +       path->pathrec.mtu_selector = IB_SA_GT;
> 
> -       ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT "\n",
> -                 IPOIB_GID_ARG(path->pathrec.dgid));
> +       switch (roundup_pow_of_two(dev->mtu + IPOIB_ENCAP_LEN)) {
> +       case 512:
> +               path->pathrec.mtu = IB_MTU_256;
> +               break;
> +       case 1024:
> +               path->pathrec.mtu = IB_MTU_512;
> +               break;
> +       case 2048:
> +               path->pathrec.mtu = IB_MTU_1024;
> +               break;
> +       case 4096:
> +               path->pathrec.mtu = IB_MTU_2048;
> +               break;
> +       default:
> +               /* Wildcard everything */
> +               comp_mask = 0;
> +               path->pathrec.mtu = 0;
> +               path->pathrec.mtu_selector = 0;
> +       }
> +       ipoib_dbg(priv, "Start path record lookup for " IPOIB_GID_FMT " MTU > %d\n",
> +                 IPOIB_GID_ARG(path->pathrec.dgid),
> +                 comp_mask ? ib_mtu_enum_to_int(path->pathrec.mtu) : 0);
> 
>         init_completion(&path->done);
> 
>         path->query_id =
>                 ib_sa_path_rec_get(priv->ca, priv->port,
> -                                  &path->pathrec,
> +                                  &path->pathrec, comp_mask    |
>                                    IB_SA_PATH_REC_DGID          |
>                                    IB_SA_PATH_REC_SGID          |
>                                    IB_SA_PATH_REC_NUMB_PATH     |


From kliteyn at dev.mellanox.co.il  Tue Dec 19 04:43:54 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 14:43:54 +0200
Subject: [openib-general] OSM: Using lid matrices in ucast manager
In-Reply-To: <1166529491.32666.241847.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
	<1166529491.32666.241847.camel@hal.voltaire.com>
Message-ID: <4587DE8A.1090207@dev.mellanox.co.il>

Hi Hal & Sasha.

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Mon, 2006-12-18 at 18:33, Yevgeny Kliteynik wrote:
>> Hi Hal.
>>
>>  
>>
>> I have a question about some patch that I want to send regarding lid
>> matrices usage in osm ucast
>>
>> manager:
>>
>>  
>>
>> The FatTree routing doesn’t use the min hop tables, so we can skip the
>> lid matrices building in OSM.
>>
>> However, ucast manager uses these lid matrices also to get the max lid
>> that is accessible from each
>>
>> switch, which defines the LTF table size.
>>
>> This max lid is obtained by calling osm_switch_get_max_lid_ho()
>> function, which in turn, calls 
>>
>> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix.
>>
>> If the lid matrices weren’t built, then the
>> osm_switch_get_max_lid_ho() function will return 0xFFFF,
>>
>> and eventually osm will crash.
>>
>>  
>>
>> Of course, I don’t want to build all the lid matrices just to know the
>> max lid, so here’s what I’ve done:
>>
>>  
>>
>>       * I added a field to the osm_switch_t object: max_lid_ho (with
>>         default value 0xFFFF, should it 
>>         be 0x0 instead?).
> 
> 0 seems better to me but I'm not sure what else this impacts.

Agree.
 
> Note also there are other 0xffff initializations similar to this which
> IMO are also candidates for change :-(
> 
>>       * Added and three osm_switch_t methods for this new field:
>>         getter, setter, and is_set that returns
>>         true if this field has been set.
> 
> Is is_set really needed ?

No, it's not - I added it just to 'encapsulate' the default value of the
new field, so that this initialization value will remain osm_switch_t internal. 
But we can access the field directly instead.
We can also replace it by something like osm_switch_lmx_exists() to make it look
more general.

>>       * The original osm_switch_get_max_lid_ho() has been updated to
>>         return this field value if it’s set.
>>       * Then in FatTree routing I set this field for each switch (I
>>         get the max lid ‘for free’ as a byproduct
>>         of the algorithm).
>>       * Now everything in the ucast manager works fine, except for the
>>         following two dump functions:
>>                 __osm_ucast_mgr_dump_ucast_routes (it uses hops)
>>                 ucast_mgr_dump_lid_matrix (obviously…)
>>         These two functions check at the beginning whether the
>>         max_lid_ho was set (using the ‘is_set’
>>         method), and return w/o printing anything if the answer is
>>         yes.
> 
> Perhaps a dump routine is a routine which each routing protocol should
> supply ?

Good idea. This way the dump function will dump whatever is relevant to 
a certain routing engine.

-- Yevgeny

> -- Hal
> 
>> This way any other routing engine that uses lid matrix is not affected
>> by this change, and any routing 
>>
>> engine that doesn’t use the lid matrix has a way to set the max lid
>> per switch explicitly.
>>
>>  
>>
>> This approach works great, but I have a feeling that this is kinda
>> hack…
>>
>>  
>>
>> What do you think about this solution?
>>
>> Any other suggestions?
>>
>>  
>>
>> Anyway, just wanted to hear your opinion before sending the patch.
>>
>>    
>>
>> Regards,
>>
>>  
>>
>> Yevgeny Kliteynik
>>
>>  
>>
>> Mellanox Technologies LTD
>>
>> Tel: +972-4-909-7200 ext: 394
>>
>> Fax: +972-4-959-3245
>>
>> P.O. Box 586 Yokneam 20692 ISRAEL 
>>
>>  
>>
>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eitan at mellanox.co.il  Tue Dec 19 05:00:35 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 15:00:35 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-19:normal
 completion
In-Reply-To: <1166530303.32666.242284.camel@hal.voltaire.com>
References: <200612190523.kBJ5NvBn018210@sw053.yok.mtl.com>
	<458796D3.50709@mellanox.co.il>
	<1166530303.32666.242284.camel@hal.voltaire.com>
Message-ID: <4587E273.4020408@mellanox.co.il>

Hal Rosenstock wrote:
> On Tue, 2006-12-19 at 02:37, Eitan Zahavi wrote:
>   
>> Clarifications:
>>
>> 1. The OpenSM code run includes the last patches I have sent.
>> 2. The single failure is due to a race in ibmgtsim. ibdiagnet waits 
>> forever for a response for a "bind" message.
>>     I suspect a deadlock between the "server" and the "node" but I am 
>> not sure.
>> 3. The regression still does not run the osmtest tests due to the fact 
>> they are all failing.
>>     
>
> Is this due to the one issue with InformInfo ?
>   
Yup.
> -- Hal
>
>   
>> EZ
>>
>> Eitan Zahavi wrote:
>>     
>>> OSM Simulation Regression Summary
>>> OpenSM rev = Mon_Dec_18_10:07:41_2006 32bfc2 MOD_FILES=3
>>> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
>>> Total=308 Pass=307 Fail=1
>>>
>>> Pass:
>>> 42 Stability IS1-16.topo
>>> 42 Pkey IS1-16.topo
>>> 42 Multicast IS1-16.topo
>>> 42 LidMgr IS1-16.topo
>>> 41 OsmStress IS1-16.topo
>>> 14 Stability IS3-loop.topo
>>> 14 Stability IS3-128.topo
>>> 14 Pkey IS3-128.topo
>>> 14 OsmStress IS3-128.topo
>>> 14 Multicast IS3-loop.topo
>>> 14 Multicast IS3-128.topo
>>> 14 LidMgr IS3-128.topo
>>>
>>> Failures:
>>> 1 OsmStress IS1-16.topo
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From jsquyres at cisco.com  Tue Dec 19 05:12:35 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 19 Dec 2006 08:12:35 -0500
Subject: [openib-general] Status of old and new servers
Message-ID: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com>

Some important decisions regarding the old server were made on the  
EWG call yesterday.  If you're still committing to SVN, do not ignore  
this e-mail.

1. The only guy with the password to the openfabrics.org domain is  
out of reach for the next several months.  So "openfabrics.org" and  
"www.openfabrics.org" will continue to point to the old server for  
the foreseeable future.  We have one name [that was intended to be  
temporary] that points to the new server (staging.openfabrics.org).   
Other than shutting down SVN, I'm not sure how we want to proceed  
with the rest of the server migration.

2. Committing to SVN on the old server will be disabled as of COB  
this *THURSDAY* (21 Dec 2006).  Anonymous, read-only access will  
still be supported for a short time longer.

3. The SVN database will be resynchronized with the new server on  
Friday, 22 Dec.  **If you have changes in SVN on the new server, THEY  
WILL BE LOST.**

3a. Reflecting that most activity is occurring in git, commits will  
be disabled in all SVN trees by default.  If you want your tree left  
enabled in SVN for commits, please reply to this e-mail indicating  
exactly which tree you want enabled for commits and a specific list  
of usernames that are allowed to commit to the tree.

3b. The rest of SVN will be available for read-only access /  
hysterical raisins for a few more months.  Proposed OFA SVN death  
date: March 31, 2007.  Per exceptions in 3a, much of the data at the  
SVN HEAD will be "svn rm"'ed to reflect that they most current stuff  
is now in git -- you'll have to use SVN history commands to get at  
the older stuff.  Appropriate README files will be left describing  
how to get to the history and to the various git repositories.

4. Everyone who had a commit account on the old server should already  
be setup with an account on the new server.

5. Work is progressing to figure out what content management system  
will be used to maintain the OFA web site on the new server.  In the  
meantime, the old pages will simply be copied over.  The OFA  
marketing group can figure out the rest.  --> Don't know what to do  
about the DNS issues yet.

6. For the time being, it is likely that we'll use the same wiki on  
the new server (tiki) and simply copy the content over.  --> Don't  
know what to do about the DNS issues yet.

7. You are among the elite group who managed to read this entire e- 
mail.  Congratulations.  Call your local representative to claim your  
fabulous prize.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From mst at mellanox.co.il  Tue Dec 19 05:16:25 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 15:16:25 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4587DD0B.1030403@voltaire.com>
References: <4587DD0B.1030403@voltaire.com>
Message-ID: <20061219131625.GE30743@mellanox.co.il>


> The problems i see with the current approach are:
> 
> 1) there are three patches

Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
It is not 100% but the only work around for proprietary SMs.
Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
get this addressed.  But of course this takes time.

> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design
> since it assumes the opensm-tavor-quirk and it would not work with 
> opensm that does not have it nor with 3rd party/commercial SMs which do 
> not have similar quirk

cma-tavor-quirk in OFED 1.1 is broken but not by design -
the patch I posted recently fixes the bug and should work with any compliant SM.
I did not look at the opensm code specifically, but the
"15.2.5.16 PATHRECORD" is quite explicit in its requirements:

MtuSelector 2 432 In a query request:
                     3-largest MTU available
                  If MTU is specified (i.e., the ComponentMask bit for
                  MTU is 1):
                     0-greater than MTU specified
                     1-less than MTU specified
                     2-exactly the MTU specified

So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
we should simply fix it. If there are other broken SMs, we can look at how they
are broken and how to solve this.

> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk
> and hence it was not pushed upstream, and vise-versa an upstream ipoib
> code is broken with the open-sm running with the quirk!

All this is incorrect.  ipoib-selector is completely irrelevant to the MTU
issue - its a strict compliance fix for IPoIB. IPoIB also works fine without
this patch (with or without tavor quirk activated). It does not depend on any
specific SM. It is not upstream because of style issues only and due to my lack
of time to fix it. 

-- 
MST


From ogerlitz at voltaire.com  Tue Dec 19 05:29:24 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 15:29:24 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061219131625.GE30743@mellanox.co.il>
References: <4587DD0B.1030403@voltaire.com>
	<20061219131625.GE30743@mellanox.co.il>
Message-ID: <4587E934.6030601@voltaire.com>

I am still digesting your response where you have addressed my 
claims/concerns.

Anyway what is your response to my suggestion of applying just one 
trivial patch at the rdma cm?

Or.


From mst at mellanox.co.il  Tue Dec 19 05:37:08 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 15:37:08 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4587E934.6030601@voltaire.com>
References: <4587E934.6030601@voltaire.com>
Message-ID: <20061219133708.GG30743@mellanox.co.il>

> I am still digesting your response where you have addressed my 
> claims/concerns.

Thatnks for raising this issue, I'll continue to think about this. In
particular, the opensm issue that you raise needs to be addressed by the opensm
guys.

> Anyway what is your response to my suggestion of applying just one 
> trivial patch at the rdma cm?

I think this would work too but I somewhat dislike using an MTU that SM
did not give us - this looks like a spec violation to me. No?
For example, it seems this assumes that any path supports 1/2 MTU but is
that required by spec? Further, might SM make an intelligent decision in selecting
a path if we tell it what MTU we actually want to use?

-- 
MST


From halr at voltaire.com  Tue Dec 19 05:40:30 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 08:40:30 -0500
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager
 fail to report back correct signal
In-Reply-To: <45870A49.1070205@mellanox.co.il>
References: <45870A49.1070205@mellanox.co.il>
Message-ID: <1166535535.32666.246023.camel@hal.voltaire.com>

Hi Eitan,

On Mon, 2006-12-18 at 16:38, Eitan Zahavi wrote:
> Hi Hal,
> 
> This is a resend as I did not see a bounce of the list of the previous 
> posting I did using git-send-email (probably due to a miss use).
> The following patch fixes bugs in the ucast manager and pkey manager 
> such that they do not report correct signal back.
> In both cases some some outstanding SubnSet were ignored.
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>
> 
> --------------------------------------------------------------------------------------------
> diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
> index 48837bc..a33aec7 100644
> --- a/osm/opensm/osm_pkey_mgr.c
> +++ b/osm/opensm/osm_pkey_mgr.c

A number of lines in osm_pkey_mgr.c are line wrapped. Please resubmit
this.

I am currently working on the osm_ucast_mgr.c changes though.

-- Hal


From eitan at mellanox.co.il  Tue Dec 19 05:50:07 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 15:50:07 +0200
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager
 fail to report back correct signal
In-Reply-To: <1166535535.32666.246023.camel@hal.voltaire.com>
References: <45870A49.1070205@mellanox.co.il>
	<1166535535.32666.246023.camel@hal.voltaire.com>
Message-ID: <4587EE0F.8090603@mellanox.co.il>

Hi Hal

Hope this will work

EZ


 From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001
From: Eitan Zahavi <eitan at sw053.yok.mtl.com>
Date: Mon, 18 Dec 2006 21:48:45 +0200
Subject: [PATCH] Fix cases where the pkey manager returned 
OSM_SIGNAL_DONE and not
OSM_SIGNAL_DONE_PENDING by missing some sent packets
---
 osm/opensm/osm_pkey_mgr.c |  112 
+++++++++++++++++++++++++++++++++------------
 1 files changed, 82 insertions(+), 30 deletions(-)

diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
index 48837bc..a33aec7 100644
--- a/osm/opensm/osm_pkey_mgr.c
+++ b/osm/opensm/osm_pkey_mgr.c
@@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
 
 /**********************************************************************
  **********************************************************************/
-static ib_api_status_t
+static boolean_t
 pkey_mgr_enforce_partition(
+  IN osm_log_t *p_log,
   IN const osm_req_t *p_req,
   IN const osm_physp_t *p_physp,
   IN const boolean_t enforce)
@@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
   osm_madw_context_t context;
   uint8_t payload[IB_SMP_DATA_SIZE];
   ib_port_info_t *p_pi;
+  ib_api_status_t status;
 
   if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
-    return IB_ERROR;
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0507: "
+              "No port info for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
 
-  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
-    return IB_SUCCESS;
+  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "No need to update PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+    return FALSE;
+  }
 
   memset( payload, 0, IB_SMP_DATA_SIZE );
   memcpy( payload, p_pi, sizeof(ib_port_info_t) );
@@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
   context.pi_context.light_sweep = FALSE;
   context.pi_context.active_transition = FALSE;
 
-  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
-                      payload, sizeof(payload),
-                      IB_MAD_ATTR_PORT_INFO,
-                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
-                      CL_DISP_MSGID_NONE, &context );
+  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
+                                payload, sizeof(payload),
+                                IB_MAD_ATTR_PORT_INFO,
+                                cl_hton32( osm_physp_get_port_num( 
p_physp ) ),
+                                CL_DISP_MSGID_NONE, &context );
+  if (status != IB_SUCCESS)
+  {
+     osm_log( p_log, OSM_LOG_ERROR,
+              "pkey_mgr_enforce_partition: ERR 0520: "
+              "Failed to set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+     return FALSE;
+  }
+  else
+  {
+     osm_log( p_log, OSM_LOG_DEBUG,
+              "pkey_mgr_enforce_partition: "
+              "Set PortInfo for "
+              "node 0x%016" PRIx64 " port %u\n",
+              cl_ntoh64(
+                 osm_node_get_node_guid(
+                    osm_physp_get_node_ptr( p_physp ))),
+              osm_physp_get_port_num( p_physp ) );
+      return TRUE;
+  }
 }
 
 /**********************************************************************
@@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
 
     status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
block_index );
     if (status == IB_SUCCESS)
-      ret_val = TRUE;
+     {
+         osm_log( p_log, OSM_LOG_DEBUG,
+                     "pkey_mgr_update_port: "
+                     "Updated "
+                     "pkey table block %d for node 0x%016" PRIx64 " 
port %u\n",
+                     block_index,
+                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+                     osm_physp_get_port_num( p_physp ) );
+         ret_val = TRUE;
+     }
     else
-      osm_log( p_log, OSM_LOG_ERROR,
-           "pkey_mgr_update_port: ERR 0506: "
-           "pkey_mgr_update_pkey_entry() failed to update "
-           "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
-           block_index,
-           cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-           osm_physp_get_port_num( p_physp ) );
+     {
+         osm_log( p_log, OSM_LOG_ERROR,
+                     "pkey_mgr_update_port: ERR 0506: "
+                     "pkey_mgr_update_pkey_entry() failed to update "
+                     "pkey table block %d for node 0x%016" PRIx64 " 
port %u\n",
+                     block_index,
+                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
+                     osm_physp_get_port_num( p_physp ) );
+     }
   }
 
   return ret_val;
@@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
   uint16_t peer_max_blocks;
   ib_api_status_t status = IB_SUCCESS;
   boolean_t ret_val = FALSE;
+  boolean_t port_info_set = FALSE;
   ib_pkey_table_t empty_block;
-
+ 
   memset(&empty_block, 0, sizeof(ib_pkey_table_t));
 
   p_physp = osm_port_get_default_phys_ptr( p_port );
@@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
     enforce = FALSE;
   }
 
-  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
-  {
-    osm_log( p_log, OSM_LOG_ERROR,
-         "pkey_mgr_update_peer_port: ERR 0507: "
-         "pkey_mgr_enforce_partition() failed to update "
-         "node 0x%016" PRIx64 " port %u\n",
-         cl_ntoh64( osm_node_get_node_guid( p_node ) ),
-         osm_physp_get_port_num( peer ) );
-  }
+  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
+      port_info_set = TRUE;
 
   if (enforce == FALSE)
-    return FALSE;
+     return port_info_set;
 
   p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
   for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
block_index++)
@@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
              osm_physp_get_port_num( peer ) );
   }
 
+  if (port_info_set) return TRUE;
   return ret_val;
 }
 
@@ -541,10 +593,10 @@ osm_pkey_mgr_process(
       signal = OSM_SIGNAL_DONE_PENDING;
     p_node = osm_port_get_parent_node( p_port );
     if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
-     pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
+            pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
                     &p_osm->subn, p_port,
                     !p_osm->subn.opt.no_partition_enforcement ) )
-      signal = OSM_SIGNAL_DONE_PENDING;       
+      signal = OSM_SIGNAL_DONE_PENDING;
   }
 
  _err:
-- 
1.4.4.1.GIT


From mst at mellanox.co.il  Tue Dec 19 05:52:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 15:52:48 +0200
Subject: [openib-general] out of office Dec 20-23
Message-ID: <20061219135248.GB2075@mellanox.co.il>

I'll be out of office Dec 20-23.
Ciao,

-- 
MST


From eitan at mellanox.co.il  Tue Dec 19 05:59:45 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 19 Dec 2006 15:59:45 +0200
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager
 fail to report back correct signal
In-Reply-To: <4587EE0F.8090603@mellanox.co.il>
References: <45870A49.1070205@mellanox.co.il>
	<1166535535.32666.246023.camel@hal.voltaire.com>
	<4587EE0F.8090603@mellanox.co.il>
Message-ID: <4587F051.2070000@mellanox.co.il>

Seems like it is line wrapped this time too.
I need a new mailer.
So I will attach the file and send it.

Sorry about that.

Eitan

Eitan Zahavi wrote:
> Hi Hal
>
> Hope this will work
>
> EZ
>
>
>  From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001
> From: Eitan Zahavi <eitan at sw053.yok.mtl.com>
> Date: Mon, 18 Dec 2006 21:48:45 +0200
> Subject: [PATCH] Fix cases where the pkey manager returned 
> OSM_SIGNAL_DONE and not
> OSM_SIGNAL_DONE_PENDING by missing some sent packets
> ---
>  osm/opensm/osm_pkey_mgr.c |  112 
> +++++++++++++++++++++++++++++++++------------
>  1 files changed, 82 insertions(+), 30 deletions(-)
>
> diff --git a/osm/opensm/osm_pkey_mgr.c b/osm/opensm/osm_pkey_mgr.c
> index 48837bc..a33aec7 100644
> --- a/osm/opensm/osm_pkey_mgr.c
> +++ b/osm/opensm/osm_pkey_mgr.c
> @@ -212,8 +212,9 @@ pkey_mgr_update_pkey_entry(
>  
>  /**********************************************************************
>   **********************************************************************/
> -static ib_api_status_t
> +static boolean_t
>  pkey_mgr_enforce_partition(
> +  IN osm_log_t *p_log,
>    IN const osm_req_t *p_req,
>    IN const osm_physp_t *p_physp,
>    IN const boolean_t enforce)
> @@ -221,12 +222,33 @@ pkey_mgr_enforce_partition(
>    osm_madw_context_t context;
>    uint8_t payload[IB_SMP_DATA_SIZE];
>    ib_port_info_t *p_pi;
> +  ib_api_status_t status;
>  
>    if (!(p_pi = osm_physp_get_port_info_ptr( p_physp )))
> -    return IB_ERROR;
> +  {
> +     osm_log( p_log, OSM_LOG_ERROR,
> +              "pkey_mgr_enforce_partition: ERR 0507: "
> +              "No port info for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +     return FALSE;
> +  }
>  
> -  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
> -    return IB_SUCCESS;
> +  if ((p_pi->vl_enforce & 0xc) == (0xc)*(enforce == TRUE))
> +  {
> +     osm_log( p_log, OSM_LOG_DEBUG,
> +              "pkey_mgr_enforce_partition: "
> +              "No need to update PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +    return FALSE;
> +  }
>  
>    memset( payload, 0, IB_SMP_DATA_SIZE );
>    memcpy( payload, p_pi, sizeof(ib_port_info_t) );
> @@ -248,11 +270,35 @@ pkey_mgr_enforce_partition(
>    context.pi_context.light_sweep = FALSE;
>    context.pi_context.active_transition = FALSE;
>  
> -  return osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
> -                      payload, sizeof(payload),
> -                      IB_MAD_ATTR_PORT_INFO,
> -                      cl_hton32( osm_physp_get_port_num( p_physp ) ),
> -                      CL_DISP_MSGID_NONE, &context );
> +  status = osm_req_set( p_req, osm_physp_get_dr_path_ptr( p_physp ),
> +                                payload, sizeof(payload),
> +                                IB_MAD_ATTR_PORT_INFO,
> +                                cl_hton32( osm_physp_get_port_num( 
> p_physp ) ),
> +                                CL_DISP_MSGID_NONE, &context );
> +  if (status != IB_SUCCESS)
> +  {
> +     osm_log( p_log, OSM_LOG_ERROR,
> +              "pkey_mgr_enforce_partition: ERR 0520: "
> +              "Failed to set PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +     return FALSE;
> +  }
> +  else
> +  {
> +     osm_log( p_log, OSM_LOG_DEBUG,
> +              "pkey_mgr_enforce_partition: "
> +              "Set PortInfo for "
> +              "node 0x%016" PRIx64 " port %u\n",
> +              cl_ntoh64(
> +                 osm_node_get_node_guid(
> +                    osm_physp_get_node_ptr( p_physp ))),
> +              osm_physp_get_port_num( p_physp ) );
> +      return TRUE;
> +  }
>  }
>  
>  /**********************************************************************
> @@ -369,15 +415,26 @@ static boolean_t pkey_mgr_update_port(
>  
>      status = pkey_mgr_update_pkey_entry( p_req, p_physp, new_block, 
> block_index );
>      if (status == IB_SUCCESS)
> -      ret_val = TRUE;
> +     {
> +         osm_log( p_log, OSM_LOG_DEBUG,
> +                     "pkey_mgr_update_port: "
> +                     "Updated "
> +                     "pkey table block %d for node 0x%016" PRIx64 " 
> port %u\n",
> +                     block_index,
> +                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +                     osm_physp_get_port_num( p_physp ) );
> +         ret_val = TRUE;
> +     }
>      else
> -      osm_log( p_log, OSM_LOG_ERROR,
> -           "pkey_mgr_update_port: ERR 0506: "
> -           "pkey_mgr_update_pkey_entry() failed to update "
> -           "pkey table block %d for node 0x%016" PRIx64 " port %u\n",
> -           block_index,
> -           cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> -           osm_physp_get_port_num( p_physp ) );
> +     {
> +         osm_log( p_log, OSM_LOG_ERROR,
> +                     "pkey_mgr_update_port: ERR 0506: "
> +                     "pkey_mgr_update_pkey_entry() failed to update "
> +                     "pkey table block %d for node 0x%016" PRIx64 " 
> port %u\n",
> +                     block_index,
> +                     cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> +                     osm_physp_get_port_num( p_physp ) );
> +     }
>    }
>  
>    return ret_val;
> @@ -405,8 +462,9 @@ pkey_mgr_update_peer_port(
>    uint16_t peer_max_blocks;
>    ib_api_status_t status = IB_SUCCESS;
>    boolean_t ret_val = FALSE;
> +  boolean_t port_info_set = FALSE;
>    ib_pkey_table_t empty_block;
> -
> + 
>    memset(&empty_block, 0, sizeof(ib_pkey_table_t));
>  
>    p_physp = osm_port_get_default_phys_ptr( p_port );
> @@ -439,18 +497,11 @@ pkey_mgr_update_peer_port(
>      enforce = FALSE;
>    }
>  
> -  if (pkey_mgr_enforce_partition( p_req, peer, enforce ) != IB_SUCCESS)
> -  {
> -    osm_log( p_log, OSM_LOG_ERROR,
> -         "pkey_mgr_update_peer_port: ERR 0507: "
> -         "pkey_mgr_enforce_partition() failed to update "
> -         "node 0x%016" PRIx64 " port %u\n",
> -         cl_ntoh64( osm_node_get_node_guid( p_node ) ),
> -         osm_physp_get_port_num( peer ) );
> -  }
> +  if (pkey_mgr_enforce_partition( p_log, p_req, peer, enforce))
> +      port_info_set = TRUE;
>  
>    if (enforce == FALSE)
> -    return FALSE;
> +     return port_info_set;
>  
>    p_peer_pkey_tbl->used_blocks = p_pkey_tbl->used_blocks;
>    for (block_index = 0; block_index < p_pkey_tbl->used_blocks; 
> block_index++)
> @@ -487,6 +538,7 @@ pkey_mgr_update_peer_port(
>               osm_physp_get_port_num( peer ) );
>    }
>  
> +  if (port_info_set) return TRUE;
>    return ret_val;
>  }
>  
> @@ -541,10 +593,10 @@ osm_pkey_mgr_process(
>        signal = OSM_SIGNAL_DONE_PENDING;
>      p_node = osm_port_get_parent_node( p_port );
>      if ( ( osm_node_get_type( p_node ) != IB_NODE_TYPE_SWITCH ) &&
> -     pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
> +            pkey_mgr_update_peer_port( &p_osm->log, &p_osm->sm.req,
>                      &p_osm->subn, p_port,
>                      !p_osm->subn.opt.no_partition_enforcement ) )
> -      signal = OSM_SIGNAL_DONE_PENDING;       
> +      signal = OSM_SIGNAL_DONE_PENDING;
>    }
>  
>   _err:
>   


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 0003-Fix-cases-where-the-pkey-manager-returned-OSM_SIGNAL_DONE-and-not.txt
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/57a2163b/attachment.txt>

From ogerlitz at voltaire.com  Tue Dec 19 06:01:04 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 16:01:04 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061219131625.GE30743@mellanox.co.il>
References: <4587DD0B.1030403@voltaire.com>
	<20061219131625.GE30743@mellanox.co.il>
Message-ID: <4587F0A0.1080401@voltaire.com>

Michael S. Tsirkin wrote:
>> 1) there are three patches
> 
> Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
> It is not 100% but the only work around for proprietary SMs.
> Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
> get this addressed.  But of course this takes time.

cma_tavor_quirk.patch matches the patch you have applied to the opensm, 
  so there are two patches at least, one at the stack and one at the 
opensm. I don't think you can assume that all SA vendors would apply the 
opensm approach and hence running with them the fixed cma tavor quirk as 
which you have suggested today is useless with them (Specifically before 
they even consider to apply it... so if someone runs OFED X.Y they would 
not get 1K mtu with Tavor)

> cma-tavor-quirk in OFED 1.1 is broken but not by design -
> the patch I posted recently fixes the bug and should work with any compliant SM.
> I did not look at the opensm code specifically, but the
> "15.2.5.16 PATHRECORD" is quite explicit in its requirements:
> 
> MtuSelector 2 432 In a query request:
>                      3-largest MTU available
>                   If MTU is specified (i.e., the ComponentMask bit for
>                   MTU is 1):
>                      0-greater than MTU specified
>                      1-less than MTU specified
>                      2-exactly the MTU specified
> 
> So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
> we should simply fix it. If there are other broken SMs, we can look at how they
> are broken and how to solve this.

The SM team here don't think our SM is broken b/c it does not return 1K 
path mtu where the minimal mtu as reported in the port info along the 
path is 2k, and as i told you so does opensm without the quirk

> 
>> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk
>> and hence it was not pushed upstream, and vise-versa an upstream ipoib
>> code is broken with the open-sm running with the quirk!
> 
> All this is incorrect.  ipoib-selector is completely irrelevant to the MTU
> issue - its a strict compliance fix for IPoIB. IPoIB also works fine without
> this patch (with or without tavor quirk activated). It does not depend on any
> specific SM. It is not upstream because of style issues only and due to my lack
> of time to fix it. 

this reminds me that there is a need to do OFED 1.1 wrapup in the sense 
we have to see which patches from the kernel_patches/fixes directory 
were ***not*** pushed upstream to 2.6.19-rcX nor 2.6.20-rc1 and then 
conduct some sort of discussion on each to decide what to do with it for 
OFED 1.2

> IB/ipoib: user appropriate mtu selector for path queries
> 
> IPoIB must set mtu selector in path record query according to dev->mtu:
> if we wildcard it, SM can select a path with lower MTU.
> This breaks IPoIB on networks with SM Tavor quirk activates.

mmm, re-reading the open sm code, i think you are right that the 
ipoib-selector patch is independent of the open SM tavor quirk, but than 
i don't understand what you were trying to say in the above two lines of 
the change log, what can break the SM tavor quirk???

Or.


From mst at mellanox.co.il  Tue Dec 19 06:17:36 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 16:17:36 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4587F0A0.1080401@voltaire.com>
References: <4587F0A0.1080401@voltaire.com>
Message-ID: <20061219141736.GD2075@mellanox.co.il>

I'm really going off in a hurry, but for now:

> Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: tavor quirks etc (opensm compliance etc)
> 
> Michael S. Tsirkin wrote:
> >> 1) there are three patches
> > 
> > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
> > It is not 100% but the only work around for proprietary SMs.
> > Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
> > get this addressed.  But of course this takes time.
> 
> cma_tavor_quirk.patch matches the patch you have applied to the opensm, 
>   so there are two patches at least, one at the stack and one at the 
> opensm. I don't think you can assume that all SA vendors would apply the 
> opensm approach and hence running with them the fixed cma tavor quirk as 
> which you have suggested today is useless with them (Specifically before 
> they even consider to apply it... so if someone runs OFED X.Y they would 
> not get 1K mtu with Tavor)

See below. With (fixed) cma_tavor_quirk.patch we are asking the SA
to give us a path with 1/2K MTU. If such path exists, SA should give it to us,
if it does not exist we should not try using it.

> > cma-tavor-quirk in OFED 1.1 is broken but not by design -
> > the patch I posted recently fixes the bug and should work with any compliant SM.
> > I did not look at the opensm code specifically, but the
> > "15.2.5.16 PATHRECORD" is quite explicit in its requirements:
> > 
> > MtuSelector 2 432 In a query request:
> >                      3-largest MTU available
> >                   If MTU is specified (i.e., the ComponentMask bit for
> >                   MTU is 1):
> >                      0-greater than MTU specified
> >                      1-less than MTU specified
> >                      2-exactly the MTU specified
> > 
> > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
> > we should simply fix it. If there are other broken SMs, we can look at how they
> > are broken and how to solve this.
> 
> The SM team here don't think our SM is broken b/c it does not return 1K 
> path mtu where the minimal mtu as reported in the port info along the 
> path is 2k, and as i told you so does opensm without the quirk

Doesn't make sense to me, and I don't understand how you interpret MtuSelector.
Does the port really report minimal MTU 2K?
So how does lower MTU work at all then?
Are you saying some HCA/switch has a broken SMA?
Eitan?

> > 
> >> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk
> >> and hence it was not pushed upstream, and vise-versa an upstream ipoib
> >> code is broken with the open-sm running with the quirk!
> > 
> > All this is incorrect.  ipoib-selector is completely irrelevant to the MTU
> > issue - its a strict compliance fix for IPoIB. IPoIB also works fine without
> > this patch (with or without tavor quirk activated). It does not depend on any
> > specific SM. It is not upstream because of style issues only and due to my lack
> > of time to fix it. 
> 
> this reminds me that there is a need to do OFED 1.1 wrapup in the sense 
> we have to see which patches from the kernel_patches/fixes directory 
> were ***not*** pushed upstream to 2.6.19-rcX nor 2.6.20-rc1 and then 
> conduct some sort of discussion on each to decide what to do with it for 
> OFED 1.2

Most were pushed, a couple are outstanding. It's on my TODO,
but if you want to start working on it go ahead.

> > IB/ipoib: user appropriate mtu selector for path queries
> > 
> > IPoIB must set mtu selector in path record query according to dev->mtu:
> > if we wildcard it, SM can select a path with lower MTU.
> > This breaks IPoIB on networks with SM Tavor quirk activates.
> 
> mmm, re-reading the open sm code, i think you are right that the 
> ipoib-selector patch is independent of the open SM tavor quirk, but than 
> i don't understand what you were trying to say in the above two lines of 
> the change log, what can break the SM tavor quirk???
> 
> Or.

The change log is wrong there.


-- 
MST


From halr at voltaire.com  Tue Dec 19 06:17:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 09:17:08 -0500
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager
 fail to report back correct signal
In-Reply-To: <45870A49.1070205@mellanox.co.il>
References: <45870A49.1070205@mellanox.co.il>
Message-ID: <1166537769.32666.247557.camel@hal.voltaire.com>

On Mon, 2006-12-18 at 16:38, Eitan Zahavi wrote:
> Hi Hal,
> 
> This is a resend as I did not see a bounce of the list of the previous 
> posting I did using git-send-email (probably due to a miss use).
> The following patch fixes bugs in the ucast manager and pkey manager 
> such that they do not report correct signal back.
> In both cases some some outstanding SubnSet were ignored.
> 
> Signed-off-by:  Eitan Zahavi <eitan at mellanox.co.il>

> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index e977253..8cfe09e 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c

Thanks! I applied the osm_ucast_mgr.c part of this (and not the
osm_pkey_mgr.c part).

-- Hal


From ogerlitz at voltaire.com  Tue Dec 19 06:27:44 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 16:27:44 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061219141736.GD2075@mellanox.co.il>
References: <4587F0A0.1080401@voltaire.com>
	<20061219141736.GD2075@mellanox.co.il>
Message-ID: <4587F6E0.10000@voltaire.com>

Michael S. Tsirkin wrote:
> I'm really going off in a hurry, but for now:

enjoy your vacation, don't worry, lets discuss this next week when you 
are back, if you want, you or Eitan or anyone else that wants to jump on 
it can send an RFC with the two patches (cma and opensm tavor quirks), 
and we can discuss why they are better from my simplified patch, what 
are the associated dependencies etc etc

Or.


From kliteyn at dev.mellanox.co.il  Tue Dec 19 06:27:03 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 16:27:03 +0200
Subject: [openib-general] OSM: Using lid matrices in ucast manager
In-Reply-To: <1166530077.32666.242175.camel@hal.voltaire.com>
References: <6C2C79E72C305246B504CBA17B5500C980BFFD@mtlexch01.mtl.com>
	<1166491856.29306.15.camel@localhost>
	<1166530077.32666.242175.camel@hal.voltaire.com>
Message-ID: <4587F6B7.7070805@dev.mellanox.co.il>

Hal,

Hal Rosenstock wrote:
> Hi Yevgeny & Sasha,
> 
> On Mon, 2006-12-18 at 20:30, Sasha Khapyorsky wrote:
>> Hi Yevgeny,
>>
>> On Tue, 2006-12-19 at 01:33 +0200, Yevgeny Kliteynik wrote:
>>> Hi Hal.
>>>
>>>  
>>>
>>> I have a question about some patch that I want to send regarding lid
>>> matrices usage in osm ucast
>>>
>>> manager:
>>>
>>>  
>>>
>>> The FatTree routing doesn’t use the min hop tables, so we can skip the
>>> lid matrices building in OSM.
>> The lid matrices are used in mcast_mgr for multicast routes generation.
> 
> Good point but fat tree seems to work for multicast (at least in my
> subnet). How could that be ?

The patch that's checked in doesn't disable lid matrices creation,
so we have those matrices created, and then fat-tree routing configures
LFTs, ignoring the lid matrices.

-- Yevgeny
 
> -- Hal
> 
>>> However, uca-st manager uses these lid matrices also to get the max lid
>>> that is accessible from each
>>>
>>> switch, which defines the LTF table size.
>>>
>>> This max lid is obtained by calling osm_switch_get_max_lid_ho()
>>> function, which in turn, calls 
>>>
>>> osm_lid_matrix_get_max_lid_ho() for the switch’s lid matrix.
>>>
>>> If the lid matrices weren’t built, then the
>>>  osm_switch_get_max_lid_ho() function will return 0xFFFF,
>>>
>>> and eventually osm will crash.
>>>
>>>  
>>>
>>> Of course, I don’t want to build all the lid matrices just to know the
>>> max lid, so here’s what I’ve done:
>>>
>>>  
>>>
>>>       * I added a field to the osm_switch_t object: max_lid_ho (with
>>>         default value 0xFFFF, should it 
>>>         be 0x0 instead?).
>> Good thing. 0 is fine as default value IMHO.
>>
>>>       * Added and three osm_switch_t methods for this new field:
>>>         getter, setter, and is_set that returns
>>>         true if this field has been set.
>> Why those methods? Everything you need is to access structure field and
>> 'if (sw->max_lid_ho)' for "is_set" checks.
>>
>>>       * The original osm_switch_get_max_lid_ho() has been updated to
>>>         return this field value if it’s set.
>>>       * Then in FatTree routing I set this field for each switch (I
>>>         get the max lid ‘for free’ as a byproduct
>>>         of the algorithm).
>>>       * Now everything in the ucast manager works fine, except for the
>>>         following two dump functions:
>>>                 __osm_ucast_mgr_dump_ucast_routes (it uses hops)
>>>                 ucast_mgr_dump_lid_matrix (obviously…)
>>>         These two functions check at the beginning whether the
>>>         max_lid_ho was set (using the ‘is_set’
>>>         method), and return w/o printing anything if the answer is
>>>         yes.
>>>
>>>  
>>>
>>> This way any other routing engine that uses lid matrix is not affected
>>> by this change, and any routing 
>>>
>>> engine that doesn’t use the lid matrix has a way to set the max lid
>>> per switch explicitly.
>> Hope you are adding this for existing code.
>>
>>> This approach works great, but I have a feeling that this is kinda
>>> hack…
>> Moving max_lid(_ho) to switch structure looks like a good idea for me
>> regardless to lid matrix build elimination.
>>
>> The only problem I can see with lid matrices is mcast_mgr which uses
>> this.
>>
>> Sasha
>>
>>>  
>>>
>>> What do you think about this solution?
>>>
>>> Any other suggestions?
>>>
>>>  
>>>
>>> Anyway, just wanted to hear your opinion before sending the patch.
>>>
>>>    
>>>
>>> Regards,
>>>
>>>  
>>>
>>> Yevgeny Kliteynik
>>>
>>>  
>>>
>>> Mellanox Technologies LTD
>>>
>>> Tel: +972-4-909-7200 ext: 394
>>>
>>> Fax: +972-4-959-3245
>>>
>>> P.O. Box 586 Yokneam 20692 ISRAEL 
>>>
>>>  
>>>
>>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From tziporet at dev.mellanox.co.il  Tue Dec 19 06:43:22 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Tue, 19 Dec 2006 16:43:22 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <20061219122453.GC30743@mellanox.co.il>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net>
	<3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
	<20061219122453.GC30743@mellanox.co.il>
Message-ID: <4587FA8A.5070204@dev.mellanox.co.il>

Michael S. Tsirkin wrote:
>> So after a bit more testing, setting the route path mtu to 1024 before
>> the qp creation (rdma_create_qp()) seems sufficient.
>>     
>
> OK, so the following fixes the tavor_quirk flag in cma to actually do something.
> Could you please replace the patch cma_tavor_quirk.patch with this one,
> set tavor_quirk option for cma module, and see if this works as expected?
>
> Unpack OFED 1.1, copy the following to
> OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch
> removing the patch by the same name that is in OFED
> (also remove xxx_cma_tavor_quirk.txt or other patches if you put them there)
> and then pack OFED 1.1 and rebuild.
>
>
> Thanks,
>
>
>   
Hi Or,
Can you update OFED support page on Wiki with this issue?

Thanks,
Tziporet


From ogerlitz at voltaire.com  Tue Dec 19 06:46:44 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Tue, 19 Dec 2006 16:46:44 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <4587FA8A.5070204@dev.mellanox.co.il>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055711DD@idaexc03.emea.cpqcorp.net>
	<3F3894AC7A13B04E83CEBC95CFD3047E05571207@idaexc03.emea.cpqcorp.net>
	<20061219122453.GC30743@mellanox.co.il>
	<4587FA8A.5070204@dev.mellanox.co.il>
Message-ID: <4587FB54.3050502@voltaire.com>

Tziporet Koren wrote:
> Hi Or,
> Can you update OFED support page on Wiki with this issue?

Basically, yes but actually, not...

We (Michael and myself) do not agree yet on some issues here, also the 
cma tavor quirk will not work with some 3rd party SM/SA, so for the time 
being i will also put there a note on how to do it in the ULP level (eg 
as Philippe was fixing Lustre)

Or.


From tziporet at mellanox.co.il  Tue Dec 19 06:53:13 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 19 Dec 2006 16:53:13 +0200
Subject: [openib-general] Status of old and new servers
In-Reply-To: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com>
References: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com>
Message-ID: <4587FCD9.3070000@mellanox.co.il>

Jeff Squyres wrote:
> 3a. Reflecting that most activity is occurring in git, commits will  
> be disabled in all SVN trees by default.  If you want your tree left  
> enabled in SVN for commits, please reply to this e-mail indicating  
> exactly which tree you want enabled for commits and a specific list  
> of usernames that are allowed to commit to the tree.
>
>   
Me, Vlad and Hal need permission for check-in for OFED 1.1 support.
So we need check in for: https://openib.org/svn/gen2/branches/1.1/
>
> 7. You are among the elite group who managed to read this entire e- 
> mail.  Congratulations.  Call your local representative to claim your  
> fabulous prize.
>
>   
I want my prize :-)
Tziporet


From philippe_bernadat at hp.com  Tue Dec 19 06:57:00 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Tue, 19 Dec 2006 15:57:00 +0100
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <4587FB54.3050502@voltaire.com>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net>

 
Koren & Or,

I am building and testing as we speak.
But my feeling is that this issue shouldn't require user to set the
tavor_quirk param.

The stack should detect this HCA flavor at the appropriate end (active
according to Or) and should automatically adjust the MTU.

Philippe


> -----Original Message-----
> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> Sent: Tuesday, December 19, 2006 3:47 PM
> To: Tziporet Koren
> Cc: Michael S. Tsirkin; Bernadat, Philippe; Roland Dreier; 
> openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> Tziporet Koren wrote:
> > Hi Or,
> > Can you update OFED support page on Wiki with this issue?
> 
> Basically, yes but actually, not...
> 
> We (Michael and myself) do not agree yet on some issues here, 
> also the 
> cma tavor quirk will not work with some 3rd party SM/SA, so 
> for the time 
> being i will also put there a note on how to do it in the ULP 
> level (eg 
> as Philippe was fixing Lustre)
> 
> Or.
> 
> 


From jsquyres at cisco.com  Tue Dec 19 07:07:47 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Tue, 19 Dec 2006 10:07:47 -0500
Subject: [openib-general] Status of old and new servers
In-Reply-To: <4587FCD9.3070000@mellanox.co.il>
References: <8F719E6E-E041-4E7F-A955-C22FD1F1B019@cisco.com>
	<4587FCD9.3070000@mellanox.co.il>
Message-ID: <1EFD919D-A491-4238-9957-0D370F52CFBA@cisco.com>

On Dec 19, 2006, at 9:53 AM, Tziporet Koren wrote:

>> 3a. Reflecting that most activity is occurring in git, commits  
>> will  be disabled in all SVN trees by default.  If you want your  
>> tree left  enabled in SVN for commits, please reply to this e-mail  
>> indicating  exactly which tree you want enabled for commits and a  
>> specific list  of usernames that are allowed to commit to the tree.
>>
> Me, Vlad and Hal need permission for check-in for OFED 1.1 support.
> So we need check in for: https://openib.org/svn/gen2/branches/1.1/

It shall be so.

>> 7. You are among the elite group who managed to read this entire  
>> e- mail.  Congratulations.  Call your local representative to  
>> claim your  fabulous prize.
>>
> I want my prize :-)

I'm sorry ma'am, only your local representative can help you with  
that.  Please hold...

;-)

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From philippe_bernadat at hp.com  Tue Dec 19 07:20:07 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Tue, 19 Dec 2006 16:20:07 +0100
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <20061219122453.GC30743@mellanox.co.il>
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B203A@idaexc03.emea.cpqcorp.net>

Sorry to say that this still doesn't do it.
Are we sure we go this path ?

I double checked the code I compiled and tried was:

static int cma_query_ib_route(struct rdma_id_private *id_priv, int
timeout_ms,
                              struct cma_work *work)
{
        struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
        struct ib_sa_path_rec path_rec;
        ib_sa_comp_mask mask;

        memset(&path_rec, 0, sizeof path_rec);
        ib_addr_get_sgid(addr, &path_rec.sgid);
        ib_addr_get_dgid(addr, &path_rec.dgid);
        path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
        path_rec.numb_path = 1;

        if (tavor_quirk) {
                path_rec.mtu_selector = IB_SA_LT;
                path_rec.mtu = IB_MTU_2048;
                mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
        } else
                mask = 0;

        id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
                                id_priv->id.port_num, &path_rec, mask |
                                IB_SA_PATH_REC_DGID |
IB_SA_PATH_REC_SGID |
                                IB_SA_PATH_REC_PKEY |
IB_SA_PATH_REC_NUMB_PATH,
                                timeout_ms, GFP_KERNEL,
                                cma_query_handler, work,
&id_priv->query);

        return (id_priv->query_id < 0) ? id_priv->query_id : 0;
}

Philippe 

> -----Original Message-----
> From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> Sent: Tuesday, December 19, 2006 1:25 PM
> To: Bernadat, Philippe
> Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org
> Subject: Re: Performance Degradation with OFED v. Voltaire(lustre)
> 
> > So after a bit more testing, setting the route path mtu to 
> 1024 before
> > the qp creation (rdma_create_qp()) seems sufficient.
> 
> OK, so the following fixes the tavor_quirk flag in cma to 
> actually do something.
> Could you please replace the patch cma_tavor_quirk.patch with 
> this one,
> set tavor_quirk option for cma module, and see if this works 
> as expected?
> 
> Unpack OFED 1.1, copy the following to
> OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch
> removing the patch by the same name that is in OFED
> (also remove xxx_cma_tavor_quirk.txt or other patches if you 
> put them there)
> and then pack OFED 1.1 and rebuild.
> 
> 
> Thanks,
> 
> -----------------
> 
> Tavor systems get better performance with 1K MTU. Since there does
> not seem to be any way to find out whether the remote system 
> uses Tavor,
> add an option to limit the MTU globally.
> 
> Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
> 
> ---
> 
> diff --git a/drivers/infiniband/core/cma.c 
> b/drivers/infiniband/core/cma.c
> index 50150c8..261bf45 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
>  MODULE_DESCRIPTION("Generic RDMA CM Agent");
>  MODULE_LICENSE("Dual BSD/GPL");
>  
> +static int tavor_quirk = 0;
> +module_param_named(tavor_quirk, tavor_quirk, int, 0644);
> +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: 
> limit MTU to 1K if > 0");
> +
>  #define CMA_CM_RESPONSE_TIMEOUT 20
>  #define CMA_MAX_CM_RETRIES 3
>  
> @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct 
> rdma_id_private *id_priv, int timeout_ms,
>  {
>  	struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
>  	struct ib_sa_path_rec path_rec;
> +	ib_sa_comp_mask mask;
>  
>  	memset(&path_rec, 0, sizeof path_rec);
>  	ib_addr_get_sgid(addr, &path_rec.sgid);
> @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct 
> rdma_id_private *id_priv, int timeout_ms,
>  	path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
>  	path_rec.numb_path = 1;
>  
> +	if (tavor_quirk) {
> +		path_rec.mtu_selector = IB_SA_LT;
> +		path_rec.mtu = IB_MTU_2048;
> +		mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
> +	} else
> +		mask = 0;
> +
>  	id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
> -				id_priv->id.port_num, &path_rec,
> +				id_priv->id.port_num, &path_rec, mask |
>  				IB_SA_PATH_REC_DGID | 
> IB_SA_PATH_REC_SGID |
>  				IB_SA_PATH_REC_PKEY | 
> IB_SA_PATH_REC_NUMB_PATH,
>  				timeout_ms, GFP_KERNEL,
> 
> -- 
> MST
> 


From halr at voltaire.com  Tue Dec 19 07:21:50 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 10:21:50 -0500
Subject: [openib-general] [PATCH] osm: pkey manager and ucast manager
 fail to report back correct signal
In-Reply-To: <4587F051.2070000@mellanox.co.il>
References: <45870A49.1070205@mellanox.co.il>
	<1166535535.32666.246023.camel@hal.voltaire.com>
	<4587EE0F.8090603@mellanox.co.il> <4587F051.2070000@mellanox.co.il>
Message-ID: <1166541690.32666.249865.camel@hal.voltaire.com>

On Tue, 2006-12-19 at 08:59, Eitan Zahavi wrote:
> Seems like it is line wrapped this time too.
> I need a new mailer.
> So I will attach the file and send it.

Or some different mail options/commands.

> Sorry about that.
> 
> Eitan
> 
> Eitan Zahavi wrote:
> > Hi Hal
> >
> > Hope this will work
> >
> > EZ
> >
> >
> >  From 557b0504ab317c470d376f15d7c6d5ed1c9d11f5 Mon Sep 17 00:00:00 2001
> > From: Eitan Zahavi <eitan at sw053.yok.mtl.com>
> > Date: Mon, 18 Dec 2006 21:48:45 +0200
> > Subject: [PATCH] Fix cases where the pkey manager returned 
> > OSM_SIGNAL_DONE and not
> > OSM_SIGNAL_DONE_PENDING by missing some sent packets
> > ---
> >  osm/opensm/osm_pkey_mgr.c |  112 

Thanks. Applied.

-- Hal


From philippe_bernadat at hp.com  Tue Dec 19 07:33:41 2006
From: philippe_bernadat at hp.com (Bernadat, Philippe)
Date: Tue, 19 Dec 2006 16:33:41 +0100
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
Message-ID: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net>

I checked. We apparently never go through this path (with lustre) 

> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Tuesday, December 19, 2006 4:20 PM
> To: Michael S. Tsirkin
> Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org
> Subject: RE: Performance Degradation with OFED v. Voltaire(lustre)
> 
> Sorry to say that this still doesn't do it.
> Are we sure we go this path ?
> 
> I double checked the code I compiled and tried was:
> 
> static int cma_query_ib_route(struct rdma_id_private 
> *id_priv, int timeout_ms,
>                               struct cma_work *work)
> {
>         struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
>         struct ib_sa_path_rec path_rec;
>         ib_sa_comp_mask mask;
> 
>         memset(&path_rec, 0, sizeof path_rec);
>         ib_addr_get_sgid(addr, &path_rec.sgid);
>         ib_addr_get_dgid(addr, &path_rec.dgid);
>         path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
>         path_rec.numb_path = 1;
> 
>         if (tavor_quirk) {
>                 path_rec.mtu_selector = IB_SA_LT;
>                 path_rec.mtu = IB_MTU_2048;
>                 mask = IB_SA_PATH_REC_MTU_SELECTOR | 
> IB_SA_PATH_REC_MTU;
>         } else
>                 mask = 0;
> 
>         id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
>                                 id_priv->id.port_num, 
> &path_rec, mask |
>                                 IB_SA_PATH_REC_DGID | 
> IB_SA_PATH_REC_SGID |
>                                 IB_SA_PATH_REC_PKEY | 
> IB_SA_PATH_REC_NUMB_PATH,
>                                 timeout_ms, GFP_KERNEL,
>                                 cma_query_handler, work, 
> &id_priv->query);
> 
>         return (id_priv->query_id < 0) ? id_priv->query_id : 0;
> }
> 
> Philippe 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > Sent: Tuesday, December 19, 2006 1:25 PM
> > To: Bernadat, Philippe
> > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org
> > Subject: Re: Performance Degradation with OFED v. Voltaire(lustre)
> > 
> > > So after a bit more testing, setting the route path mtu to 
> > 1024 before
> > > the qp creation (rdma_create_qp()) seems sufficient.
> > 
> > OK, so the following fixes the tavor_quirk flag in cma to 
> > actually do something.
> > Could you please replace the patch cma_tavor_quirk.patch with 
> > this one,
> > set tavor_quirk option for cma module, and see if this works 
> > as expected?
> > 
> > Unpack OFED 1.1, copy the following to
> > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch
> > removing the patch by the same name that is in OFED
> > (also remove xxx_cma_tavor_quirk.txt or other patches if you 
> > put them there)
> > and then pack OFED 1.1 and rebuild.
> > 
> > 
> > Thanks,
> > 
> > -----------------
> > 
> > Tavor systems get better performance with 1K MTU. Since there does
> > not seem to be any way to find out whether the remote system 
> > uses Tavor,
> > add an option to limit the MTU globally.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
> > 
> > ---
> > 
> > diff --git a/drivers/infiniband/core/cma.c 
> > b/drivers/infiniband/core/cma.c
> > index 50150c8..261bf45 100644
> > --- a/drivers/infiniband/core/cma.c
> > +++ b/drivers/infiniband/core/cma.c
> > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
> >  MODULE_DESCRIPTION("Generic RDMA CM Agent");
> >  MODULE_LICENSE("Dual BSD/GPL");
> >  
> > +static int tavor_quirk = 0;
> > +module_param_named(tavor_quirk, tavor_quirk, int, 0644);
> > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: 
> > limit MTU to 1K if > 0");
> > +
> >  #define CMA_CM_RESPONSE_TIMEOUT 20
> >  #define CMA_MAX_CM_RETRIES 3
> >  
> > @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct 
> > rdma_id_private *id_priv, int timeout_ms,
> >  {
> >  	struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
> >  	struct ib_sa_path_rec path_rec;
> > +	ib_sa_comp_mask mask;
> >  
> >  	memset(&path_rec, 0, sizeof path_rec);
> >  	ib_addr_get_sgid(addr, &path_rec.sgid);
> > @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct 
> > rdma_id_private *id_priv, int timeout_ms,
> >  	path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
> >  	path_rec.numb_path = 1;
> >  
> > +	if (tavor_quirk) {
> > +		path_rec.mtu_selector = IB_SA_LT;
> > +		path_rec.mtu = IB_MTU_2048;
> > +		mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
> > +	} else
> > +		mask = 0;
> > +
> >  	id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
> > -				id_priv->id.port_num, &path_rec,
> > +				id_priv->id.port_num, &path_rec, mask |
> >  				IB_SA_PATH_REC_DGID | 
> > IB_SA_PATH_REC_SGID |
> >  				IB_SA_PATH_REC_PKEY | 
> > IB_SA_PATH_REC_NUMB_PATH,
> >  				timeout_ms, GFP_KERNEL,
> > 
> > -- 
> > MST
> > 


From mst at mellanox.co.il  Tue Dec 19 07:48:00 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 17:48:00 +0200
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net>
Message-ID: <20061219154800.GB3428@mellanox.co.il>

Interesting. So, does lustre actually work on top of rdma_cm?

Quoting r. Bernadat, Philippe <philippe_bernadat at hp.com>:
Subject: RE: Performance Degradation with OFED v. Voltaire(lustre)

I checked. We apparently never go through this path (with lustre) 

> -----Original Message-----
> From: Bernadat, Philippe 
> Sent: Tuesday, December 19, 2006 4:20 PM
> To: Michael S. Tsirkin
> Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org
> Subject: RE: Performance Degradation with OFED v. Voltaire(lustre)
> 
> Sorry to say that this still doesn't do it.
> Are we sure we go this path ?
> 
> I double checked the code I compiled and tried was:
> 
> static int cma_query_ib_route(struct rdma_id_private 
> *id_priv, int timeout_ms,
>                               struct cma_work *work)
> {
>         struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
>         struct ib_sa_path_rec path_rec;
>         ib_sa_comp_mask mask;
> 
>         memset(&path_rec, 0, sizeof path_rec);
>         ib_addr_get_sgid(addr, &path_rec.sgid);
>         ib_addr_get_dgid(addr, &path_rec.dgid);
>         path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
>         path_rec.numb_path = 1;
> 
>         if (tavor_quirk) {
>                 path_rec.mtu_selector = IB_SA_LT;
>                 path_rec.mtu = IB_MTU_2048;
>                 mask = IB_SA_PATH_REC_MTU_SELECTOR | 
> IB_SA_PATH_REC_MTU;
>         } else
>                 mask = 0;
> 
>         id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
>                                 id_priv->id.port_num, 
> &path_rec, mask |
>                                 IB_SA_PATH_REC_DGID | 
> IB_SA_PATH_REC_SGID |
>                                 IB_SA_PATH_REC_PKEY | 
> IB_SA_PATH_REC_NUMB_PATH,
>                                 timeout_ms, GFP_KERNEL,
>                                 cma_query_handler, work, 
> &id_priv->query);
> 
>         return (id_priv->query_id < 0) ? id_priv->query_id : 0;
> }
> 
> Philippe 
> 
> > -----Original Message-----
> > From: Michael S. Tsirkin [mailto:mst at mellanox.co.il] 
> > Sent: Tuesday, December 19, 2006 1:25 PM
> > To: Bernadat, Philippe
> > Cc: Or Gerlitz; Roland Dreier; openib-general at openib.org
> > Subject: Re: Performance Degradation with OFED v. Voltaire(lustre)
> > 
> > > So after a bit more testing, setting the route path mtu to 
> > 1024 before
> > > the qp creation (rdma_create_qp()) seems sufficient.
> > 
> > OK, so the following fixes the tavor_quirk flag in cma to 
> > actually do something.
> > Could you please replace the patch cma_tavor_quirk.patch with 
> > this one,
> > set tavor_quirk option for cma module, and see if this works 
> > as expected?
> > 
> > Unpack OFED 1.1, copy the following to
> > OFED-1.1/openib-1.1/kernel_patches/fixes/cma_tavor_quirk.patch
> > removing the patch by the same name that is in OFED
> > (also remove xxx_cma_tavor_quirk.txt or other patches if you 
> > put them there)
> > and then pack OFED 1.1 and rebuild.
> > 
> > 
> > Thanks,
> > 
> > -----------------
> > 
> > Tavor systems get better performance with 1K MTU. Since there does
> > not seem to be any way to find out whether the remote system 
> > uses Tavor,
> > add an option to limit the MTU globally.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>
> > 
> > ---
> > 
> > diff --git a/drivers/infiniband/core/cma.c 
> > b/drivers/infiniband/core/cma.c
> > index 50150c8..261bf45 100644
> > --- a/drivers/infiniband/core/cma.c
> > +++ b/drivers/infiniband/core/cma.c
> > @@ -48,6 +48,10 @@ MODULE_AUTHOR("Sean Hefty");
> >  MODULE_DESCRIPTION("Generic RDMA CM Agent");
> >  MODULE_LICENSE("Dual BSD/GPL");
> >  
> > +static int tavor_quirk = 0;
> > +module_param_named(tavor_quirk, tavor_quirk, int, 0644);
> > +MODULE_PARM_DESC(tavor_quirk, "Tavor performance quirk: 
> > limit MTU to 1K if > 0");
> > +
> >  #define CMA_CM_RESPONSE_TIMEOUT 20
> >  #define CMA_MAX_CM_RETRIES 3
> >  
> > @@ -1138,6 +1142,7 @@ static int cma_query_ib_route(struct 
> > rdma_id_private *id_priv, int timeout_ms,
> >  {
> >  	struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr;
> >  	struct ib_sa_path_rec path_rec;
> > +	ib_sa_comp_mask mask;
> >  
> >  	memset(&path_rec, 0, sizeof path_rec);
> >  	ib_addr_get_sgid(addr, &path_rec.sgid);
> > @@ -1145,8 +1150,15 @@ static int cma_query_ib_route(struct 
> > rdma_id_private *id_priv, int timeout_ms,
> >  	path_rec.pkey = cpu_to_be16(ib_addr_get_pkey(addr));
> >  	path_rec.numb_path = 1;
> >  
> > +	if (tavor_quirk) {
> > +		path_rec.mtu_selector = IB_SA_LT;
> > +		path_rec.mtu = IB_MTU_2048;
> > +		mask = IB_SA_PATH_REC_MTU_SELECTOR | IB_SA_PATH_REC_MTU;
> > +	} else
> > +		mask = 0;
> > +
> >  	id_priv->query_id = ib_sa_path_rec_get(id_priv->id.device,
> > -				id_priv->id.port_num, &path_rec,
> > +				id_priv->id.port_num, &path_rec, mask |
> >  				IB_SA_PATH_REC_DGID | 
> > IB_SA_PATH_REC_SGID |
> >  				IB_SA_PATH_REC_PKEY | 
> > IB_SA_PATH_REC_NUMB_PATH,
> >  				timeout_ms, GFP_KERNEL,
> > 
> > -- 
> > MST
> > 

-- 
MST


From mst at mellanox.co.il  Tue Dec 19 07:50:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 17:50:48 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <4587FB54.3050502@voltaire.com>
References: <4587FB54.3050502@voltaire.com>
Message-ID: <20061219155048.GC3428@mellanox.co.il>

> > Hi Or,
> > Can you update OFED support page on Wiki with this issue?
> 
> Basically, yes but actually, not...
> 
> We (Michael and myself) do not agree yet on some issues here, also the 
> cma tavor quirk will not work with some 3rd party SM/SA,

This last issue could be addressed by e.g. forcing MTU if SA does not
give us a path we asked for.
Any data on which SA is this?
What does it do when you set MTU selector to "less than"?

> so for the time 
> being i will also put there a note on how to do it in the ULP level (eg 
> as Philippe was fixing Lustre)

Makes sense.


-- 
MST


From mst at mellanox.co.il  Tue Dec 19 07:54:18 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 17:54:18 +0200
Subject: [openib-general] Performance Degradation with OFED v.
 Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055B1FCB@idaexc03.emea.cpqcorp.net>
Message-ID: <20061219155418.GD3428@mellanox.co.il>

Correct. But endnode does not know what's on the other side of the link,
and which path MTU is best.

This is SA's job (SA sees all the topology), and you will get exactly this behaviour
you ask for if you run opensm with "quirk mode" enabled (you must disable the *other*
SM you are running though for this to take effect).

This mode will be enabled by default in OFED 1.2.

Quoting r. Bernadat, Philippe <philippe_bernadat at hp.com>:
Subject: RE: [openib-general] Performance Degradation with OFED v. Voltaire(lustre)

 
Koren & Or,

I am building and testing as we speak.
But my feeling is that this issue shouldn't require user to set the
tavor_quirk param.

The stack should detect this HCA flavor at the appropriate end (active
according to Or) and should automatically adjust the MTU.

Philippe


> -----Original Message-----
> From: Or Gerlitz [mailto:ogerlitz at voltaire.com] 
> Sent: Tuesday, December 19, 2006 3:47 PM
> To: Tziporet Koren
> Cc: Michael S. Tsirkin; Bernadat, Philippe; Roland Dreier; 
> openib-general at openib.org
> Subject: Re: [openib-general] Performance Degradation with 
> OFED v. Voltaire(lustre)
> 
> Tziporet Koren wrote:
> > Hi Or,
> > Can you update OFED support page on Wiki with this issue?
> 
> Basically, yes but actually, not...
> 
> We (Michael and myself) do not agree yet on some issues here, 
> also the 
> cma tavor quirk will not work with some 3rd party SM/SA, so 
> for the time 
> being i will also put there a note on how to do it in the ULP 
> level (eg 
> as Philippe was fixing Lustre)
> 
> Or.
> 
> 

-- 
MST


From mst at mellanox.co.il  Tue Dec 19 08:02:21 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 18:02:21 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4587F6E0.10000@voltaire.com>
References: <4587F6E0.10000@voltaire.com>
Message-ID: <20061219160221.GE3428@mellanox.co.il>

> Subject: Re: tavor quirks etc (opensm compliance etc)
> 
> Michael S. Tsirkin wrote:
> > I'm really going off in a hurry, but for now:
> 
> enjoy your vacation, don't worry, lets discuss this next week when you 
> are back, if you want, you or Eitan or anyone else that wants to jump on 
> it can send an RFC with the two patches (cma and opensm tavor quirks), 
> and we can discuss why they are better from my simplified patch, what 
> are the associated dependencies etc etc

Or, thanks.
Note opensm support is already in, and CMA patch was also in OFED 1.1 and
were discussed before OFED 1.1 - it had a trivial typo but I just fixed the missing
comp mask selector and it will be pushed to ofed 1.2 tree at staging
in short order.

I am not yet sure what is best for upstream, so I don't really think we need
any RFCs.

We'll need data from SM guys on whether MTU selector actually works
in SMs, and if not what happens when you enable it.

-- 
MST


From tziporet at mellanox.co.il  Tue Dec 19 08:04:54 2006
From: tziporet at mellanox.co.il (Tziporet Koren)
Date: Tue, 19 Dec 2006 18:04:54 +0200
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
Message-ID: <45880DA6.4040403@mellanox.co.il>

Agenda items:

   1. Daily build update
   2. OFED 1.2 features status
   3. SVN server change - decisions are covered on Jeff's mail

Meeting summary:
*1. Daily build update:*
Daily build is now based on kernel 2.6.20-rc1.
Testing status:

    * Voltaire started daily testing based on this build.
    * Qlogic - will start next week
    * IBM - will start testing this week
    * Mellanox - testing run daily based on daily build
    * Need to know what is the git branch for ucma and udapl - Sean and
      Arlin

*2. Features update:*
We reviewed the features list that is published on the Wiki and in 
general most items are on schedule for 31-Jan.
Some updates and AIs:

    * Prepare SA cache for OFED 1.2 - Sean/Woody
    * VNIC: Qlogic are working according to the How-to explanation.
      Should be ready soon.
    * ehca - Interrupt handling for IPoIB NAPI support may miss kernel
      2.6.10 but should be ready for OFED.
    * Memory windows may be dropped form libibverbs 1.1
    * QoS - coding was not started but should be OK.
    * Open MPI: alpha release will include pre-release and the final
      version will be replaced for the beta
    * MVAPICH 0.9.9 is on track for the code freeze too.
    * iWARP - no iWARP representative joined the meeting. Need an update
      form some iWARP developers regarding their progress in preparing
      iWARP for OFED 1.2.
    * Bonding module - Voltaire are working on backport patches for
      SLES10 and Redhat EL4. Wish that this module will be part of OFED
    * RDS - License issue is still pending Oracle legal department. May
      be as an add-on package

Note: There was a discussion about moving Roland's user space git from 
kernel.org server to OFA server. We put this discussion on hold since 
Roland was not on the meting. AI - Roland to participate in the next 
meeting to close this subject.

Tziporet


From sweitzen at cisco.com  Tue Dec 19 08:58:17 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 19 Dec 2006 08:58:17 -0800
Subject: [openib-general] OFED release testing Task Force
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B9A900@xmb-sjc-216.amer.cisco.com>

I can represent Cisco.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi
	Sent: Wednesday, November 22, 2006 10:30 AM
	To: openfabrics-ewg at openib.org
	Cc: openib-general at openib.org
	Subject: [openib-general] OFED release testing Task Force
	
	
	Hi,

	As a follow-up on the presentation prepared and presented by
Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out
this e-mail as a call for participation.

	The targets of the Ad-hoc task force will be (as agreed upon in
the session we had): unify the test results formats, define release
quality criteria, define/assign ULP verification owners and enhance
interoperability finger-print in the release process.

	
	We would like to have a participant from each contributing
company and would appreciate any response sent to me with a name of a
person from the company to attend and take action on behalf of this task
force.

	BTW: I've also attached the presentation that was given in the
OFA meeting.

	<<OFED testing session.pps>> 

	Happy Holidays to every one,

	
	Nimrod  Gindi

	Mellanox Technologies Ltd.

	mail  :  nimrodg at mellanox.com

	Cell  :  +1-408-750-4801

	Office:  +1-347-342-0011

	Fax   :  +1-212-987-0275

	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/b7360674/attachment.html>

From swise at opengridcomputing.com  Tue Dec 19 09:26:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Tue, 19 Dec 2006 11:26:55 -0600
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
In-Reply-To: <45880DA6.4040403@mellanox.co.il>
References: <45880DA6.4040403@mellanox.co.il>
Message-ID: <1166549215.31612.18.camel@stevo-desktop>

On Tue, 2006-12-19 at 18:04 +0200, Tziporet Koren wrote:
>     * iWARP - no iWARP representative joined the meeting. Need an
> update
>       form some iWARP developers regarding their progress in preparing
>       iWARP for OFED 1.2. 

iWARP support is in 2.6.19.  The question is really if any iWARP device
drivers/libraries will be in OFED 1.2.  

Timewise, I'm not in a position to push in the Ammasso device.  If
somebody wants it in OFED 1.2, then they should drive that.  The kernel
driver will be in OFED 1.2 because its in 2.6.19.  The library would
need an owner to do the work to get it into OFED 1.2.

I'm focusing now on getting the Chelsio drivers into 2.6.20.  

If that doesn't happen, will OFED 1.2 still entertain pulling in Chelsio
drivers? 

Either way, I cannot begin to work on OFED 1.2 with Chelsio until the
new year.


Steve.


From nimrodg at mellanox.com  Tue Dec 19 09:28:41 2006
From: nimrodg at mellanox.com (Nimrod Gindi)
Date: Tue, 19 Dec 2006 09:28:41 -0800
Subject: [openib-general] OFED release testing Task Force
Message-ID: <1E3DCD1C63492545881FACB6063A57C1AF865F@mtiexch01.mti.com>

Thanks - I will send a consolidating e-mail to the task force people and will try to have the kick off meeting 1st week of 2007

  Nimrod  Gindi
	Mellanox Technologies Ltd.
	mail:  nimrodg at mellanox.com
	Cellular:  +1-408-750-4801
	Office:    +1-347-342-0011
	Fax:        +1-212-987-0275

----- Original Message -----
From: Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>
To: Nimrod Gindi; openfabrics-ewg at openib.org <openfabrics-ewg at openib.org>
Cc: openib-general at openib.org <openib-general at openib.org>
Sent: Tue Dec 19 08:58:17 2006
Subject: RE: [openib-general] OFED release testing Task Force

I can represent Cisco.
 
Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems
 

________________________________

	From: openib-general-bounces at openib.org [mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi
	Sent: Wednesday, November 22, 2006 10:30 AM
	To: openfabrics-ewg at openib.org
	Cc: openib-general at openib.org
	Subject: [openib-general] OFED release testing Task Force
	
	
	Hi,

	As a follow-up on the presentation prepared and presented by Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out this e-mail as a call for participation.

	The targets of the Ad-hoc task force will be (as agreed upon in the session we had): unify the test results formats, define release quality criteria, define/assign ULP verification owners and enhance interoperability finger-print in the release process.

		We would like to have a participant from each contributing company and would appreciate any response sent to me with a name of a person from the company to attend and take action on behalf of this task force.

	BTW: I've also attached the presentation that was given in the OFA meeting.

	<<OFED testing session.pps>> 

	Happy Holidays to every one,

		Nimrod  Gindi

	Mellanox Technologies Ltd.

	mail  :  nimrodg at mellanox.com

	Cell  :  +1-408-750-4801

	Office:  +1-347-342-0011

	Fax   :  +1-212-987-0275

	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/5607ad19/attachment.html>

From kliteyn at dev.mellanox.co.il  Tue Dec 19 09:25:04 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 19:25:04 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061219131625.GE30743@mellanox.co.il>
References: <4587DD0B.1030403@voltaire.com>
	<20061219131625.GE30743@mellanox.co.il>
Message-ID: <45882070.8040101@dev.mellanox.co.il>

Michael,

Michael S. Tsirkin wrote:
>> The problems i see with the current approach are:
>>
>> 1) there are three patches
> 
> Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
> It is not 100% but the only work around for proprietary SMs.
> Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
> get this addressed.  But of course this takes time.
> 
>> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design
>> since it assumes the opensm-tavor-quirk and it would not work with 
>> opensm that does not have it nor with 3rd party/commercial SMs which do 
>> not have similar quirk
> 
> cma-tavor-quirk in OFED 1.1 is broken but not by design -
> the patch I posted recently fixes the bug and should work with any compliant SM.
> I did not look at the opensm code specifically, but the
> "15.2.5.16 PATHRECORD" is quite explicit in its requirements:
> 
> MtuSelector 2 432 In a query request:
>                      3-largest MTU available
>                   If MTU is specified (i.e., the ComponentMask bit for
>                   MTU is 1):
>                      0-greater than MTU specified
>                      1-less than MTU specified
>                      2-exactly the MTU specified
> 
> So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
> we should simply fix it. If there are other broken SMs, we can look at how they
> are broken and how to solve this.
 
OSM implementation in this case matches the IB spec. 
On page 905, table 207, there's an example of such a 
request: 
	Required MTU = 4 (2048)
	Required MTUSelector = 1 ('less-than')
And then it is explained that the required path records
should have MTU of 1024 or lower.

OSM implementation converts these rules to code AS IS.

Now, what you're actually saying, is that the specification
in this case is bad. In our discussion, you said that if
you request MTU of X with MTU selector of 'less-than', you
want to also get any path records that supports MTU greater
than X, because they also support MTUs <= X.
The question is, if your understanding of spec is right, 
what's the point of having 'less-than' selector at all?
I mean, if selector says 'less-than', but you also accept 
MTU that are 'equal' and 'greater-than', then it looks like 
you actually don't care about the MTU, because any MTU would
be OK.

--Yevgeny

>> 3) the ipoib-selector patch (below) in a way assumes the open-sm quirk
>> and hence it was not pushed upstream, and vise-versa an upstream ipoib
>> code is broken with the open-sm running with the quirk!
> 
> All this is incorrect.  ipoib-selector is completely irrelevant to the MTU
> issue - its a strict compliance fix for IPoIB. IPoIB also works fine without
> this patch (with or without tavor quirk activated). It does not depend on any
> specific SM. It is not upstream because of style issues only and due to my lack
> of time to fix it. 
> 


From sweitzen at cisco.com  Tue Dec 19 09:36:59 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Tue, 19 Dec 2006 09:36:59 -0800
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B9A943@xmb-sjc-216.amer.cisco.com>


> Meeting summary:
> *1. Daily build update:*
> Daily build is now based on kernel 2.6.20-rc1.

Where is the daily build?

Scott


From mst at mellanox.co.il  Tue Dec 19 10:35:22 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 20:35:22 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <45882070.8040101@dev.mellanox.co.il>
References: <4587DD0B.1030403@voltaire.com>
	<20061219131625.GE30743@mellanox.co.il>
	<45882070.8040101@dev.mellanox.co.il>
Message-ID: <20061219183522.GC8163@mellanox.co.il>


Quoting r. Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>:
Subject: Re: [openib-general] tavor quirks etc (opensm compliance etc)

Michael,

> Michael S. Tsirkin wrote:
> >> The problems i see with the current approach are:
> >>
> >> 1) there are three patches
> > 
> > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
> > It is not 100% but the only work around for proprietary SMs.
> > Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
> > get this addressed.  But of course this takes time.
> > 
> >> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design
> >> since it assumes the opensm-tavor-quirk and it would not work with 
> >> opensm that does not have it nor with 3rd party/commercial SMs which do 
> >> not have similar quirk
> > 
> > cma-tavor-quirk in OFED 1.1 is broken but not by design -
> > the patch I posted recently fixes the bug and should work with any compliant SM.
> > I did not look at the opensm code specifically, but the
> > "15.2.5.16 PATHRECORD" is quite explicit in its requirements:
> > 
> > MtuSelector 2 432 In a query request:
> >                      3-largest MTU available
> >                   If MTU is specified (i.e., the ComponentMask bit for
> >                   MTU is 1):
> >                      0-greater than MTU specified
> >                      1-less than MTU specified
> >                      2-exactly the MTU specified
> > 
> > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
> > we should simply fix it. If there are other broken SMs, we can look at how they
> > are broken and how to solve this.
>  
> OSM implementation in this case matches the IB spec. 
> On page 905, table 207, there's an example of such a 
> request: 
> 	Required MTU = 4 (2048)
> 	Required MTUSelector = 1 ('less-than')
> And then it is explained that the required path records
> should have MTU of 1024 or lower.
> 
> OSM implementation converts these rules to code AS IS.
> 
> Now, what you're actually saying, is that the specification
> in this case is bad. In our discussion, you said that if
> you request MTU of X with MTU selector of 'less-than', you
> want to also get any path records that supports MTU greater
> than X, because they also support MTUs <= X.
> The question is, if your understanding of spec is right, 
> what's the point of having 'less-than' selector at all?
> I mean, if selector says 'less-than', but you also accept 
> MTU that are 'equal' and 'greater-than', then it looks like 
> you actually don't care about the MTU, because any MTU would
> be OK.
> 

I believe you misrepresent what I am saying.

I understand the spec in the following way:
if I set MTU selector to less than 1K, and there is a path
that can support MTU of 1/2 K, I expect it is legal for SM
to select that path and return it to me, setting
the MTU selector to a value of 1/2K or less.

Whether that path *also* supports higher MTUs need not be relevant -
whether SM will prefer another path in this case is up to SM,
but it is clear that if there are paths that satisfy the request, it does not
make sense to fail the request because paths have more capabilities.
--Yevgeny


-- 
MST


From ggrundstrom at NetEffect.com  Tue Dec 19 11:16:06 2006
From: ggrundstrom at NetEffect.com (Glenn Grundstrom)
Date: Tue, 19 Dec 2006 13:16:06 -0600
Subject: [openib-general] OFED release testing Task Force
Message-ID: <5E701717F2B2ED4EA60F87C8AA57B7CC0681FF42@venom2>

I will represent NetEffect.
 
Glenn Grundstrom.

________________________________

From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi
Sent: Tuesday, December 19, 2006 11:29 AM
To: sweitzen at cisco.com; openfabrics-ewg at openib.org
Cc: openib-general at openib.org
Subject: Re: [openib-general] OFED release testing Task Force


Thanks - I will send a consolidating e-mail to the task force people and
will try to have the kick off meeting 1st week of 2007

  Nimrod  Gindi
        Mellanox Technologies Ltd.
        mail:  nimrodg at mellanox.com
        Cellular:  +1-408-750-4801
        Office:    +1-347-342-0011
        Fax:        +1-212-987-0275

----- Original Message -----
From: Scott Weitzenkamp (sweitzen) <sweitzen at cisco.com>
To: Nimrod Gindi; openfabrics-ewg at openib.org
<openfabrics-ewg at openib.org>
Cc: openib-general at openib.org <openib-general at openib.org>
Sent: Tue Dec 19 08:58:17 2006
Subject: RE: [openib-general] OFED release testing Task Force

I can represent Cisco.

Scott Weitzenkamp
SQA and Release Manager
Server Virtualization Business Unit
Cisco Systems


________________________________

        From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Nimrod Gindi
        Sent: Wednesday, November 22, 2006 10:30 AM
        To: openfabrics-ewg at openib.org
        Cc: openib-general at openib.org
        Subject: [openib-general] OFED release testing Task Force
       
       
        Hi,

        As a follow-up on the presentation prepared and presented by
Amit Krig and my-self in the OFA Meeting during SC06 I'm sending out
this e-mail as a call for participation.

        The targets of the Ad-hoc task force will be (as agreed upon in
the session we had): unify the test results formats, define release
quality criteria, define/assign ULP verification owners and enhance
interoperability finger-print in the release process.

                We would like to have a participant from each
contributing company and would appreciate any response sent to me with a
name of a person from the company to attend and take action on behalf of
this task force.

        BTW: I've also attached the presentation that was given in the
OFA meeting.

        <<OFED testing session.pps>>

        Happy Holidays to every one,

                Nimrod  Gindi

        Mellanox Technologies Ltd.

        mail  :  nimrodg at mellanox.com

        Cell  :  +1-408-750-4801

        Office:  +1-347-342-0011

        Fax   :  +1-212-987-0275

        
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/88a6c7a8/attachment.html>

From kliteyn at dev.mellanox.co.il  Tue Dec 19 11:35:16 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 21:35:16 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
Message-ID: <45883EF4.1050705@dev.mellanox.co.il>

Hi Hal

Adding max_lid_ho field to osm_switch_t to allow routing
engines that don't use lid matrices to explicitly set the
max lid (in host order) that is reachable from the switch.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/opensm/osm_switch.h |   37 +++++++++++++++++++++++++++++++++++++
 osm/opensm/osm_switch.c         |    2 ++
 2 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h
index 4570f61..d2089bd 100644
--- a/osm/include/opensm/osm_switch.h
+++ b/osm/include/opensm/osm_switch.h
@@ -107,6 +107,7 @@ typedef struct _osm_switch
 	ib_switch_info_t			switch_info;
 	osm_fwd_tbl_t				fwd_tbl;
 	osm_lid_matrix_t			lmx;
+	uint16_t				max_lid_ho;
 	osm_port_profile_t			*p_prof;
 	osm_mcast_tbl_t				mcast_tbl;
 	uint32_t				discovery_count;
@@ -129,6 +130,9 @@ typedef struct _osm_switch
 *		LID Matrix for this switch containing the hop count
 *		to every LID from every port.
 *
+*	max_lid_ho
+*		Max LID that is accessible from this switch
+* 
 *	p_pro
 *		Pointer to array of Port Profile objects for this switch.
 *
@@ -793,6 +797,8 @@ static inline uint16_t
 osm_switch_get_max_lid_ho(
 	IN const osm_switch_t* const p_sw )
 {
+	if (p_sw->max_lid_ho != 0)
+		return p_sw->max_lid_ho;
 	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
 }
 /*
@@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
 * SEE ALSO
 *********/
 
+/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
+* NAME
+*	osm_switch_set_max_lid_ho
+*
+* DESCRIPTION
+*	Set the maximum LID (host order) value accessed from this switch
+* SYNOPSIS
+*/
+static inline void
+osm_switch_set_max_lid_ho(
+	IN osm_switch_t* const p_sw,
+	IN uint16_t max_lid_ho )
+{
+	p_sw->max_lid_ho = max_lid_ho;
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to a switch object.
+*
+*	max_lid_ho
+*		Max LID (host order) value accessed from this switch
+*
+* RETURN VALUES
+*	None.
+*
+* NOTES
+*
+* SEE ALSO
+*********/
+
 /****f* OpenSM: Switch/osm_switch_get_num_ports
 * NAME
 *	osm_switch_get_num_ports
diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
index 0dd3de5..4ca713a 100644
--- a/osm/opensm/osm_switch.c
+++ b/osm/opensm/osm_switch.c
@@ -122,6 +122,8 @@ osm_switch_init(
   for( port_num = 0; port_num < num_ports; port_num++ )
     osm_port_prof_construct( &p_sw->p_prof[port_num] );
 
+  p_sw->max_lid_ho = 0;
+
  Exit:
   return( status );
 }
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Tue Dec 19 11:37:29 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 21:37:29 +0200
Subject: [openib-general] [PATCH] osm: added an option for providing dump
 function per routing engine
Message-ID: <45883F79.6090109@dev.mellanox.co.il>

Hi Hal

As you suggested, added an option for providing dump 
function per routing engine.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/opensm/osm_opensm.h |    4 ++++
 osm/opensm/osm_ucast_mgr.c      |   23 ++++++++++++++---------
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/osm/include/opensm/osm_opensm.h b/osm/include/opensm/osm_opensm.h
index 653c8ec..16fef37 100644
--- a/osm/include/opensm/osm_opensm.h
+++ b/osm/include/opensm/osm_opensm.h
@@ -104,6 +104,7 @@ struct osm_routing_engine {
 	void *context;
 	int (*build_lid_matrices)(void *context);
 	int (*ucast_build_fwd_tables)(void *context);
+	void (*ucast_dump_tables)(void *context);
 	void (*delete)(void *context);
 };
 /*
@@ -121,6 +122,9 @@ struct osm_routing_engine {
 *	ucast_build_fwd_tables
 *		The callback for unicast forwarding table generation.
 *
+*	ucast_dump_tables
+*		The callback for dumping unicast routing tables.
+*
 *	delete
 *		The delete method, may be used for routing engine
 *		internals cleanup.
diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index e051c66..fcf6f72 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -149,7 +149,7 @@ ucast_mgr_dump(osm_ucast_mgr_t *p_mgr, F
 	cl_qmap_apply_func(&p_mgr->p_subn->sw_guid_tbl, func, &dump_context);
 }
 
-static void
+void
 ucast_mgr_dump_to_file(osm_ucast_mgr_t *p_mgr, const char *file_name,
 		       void (*func)(cl_map_item_t *, void *))
 {
@@ -350,7 +350,7 @@ ucast_mgr_dump_lid_matrix(cl_map_item_t
 
 /**********************************************************************
  **********************************************************************/
-static void
+void
 ucast_mgr_dump_lfts(cl_map_item_t *p_map_item, void *cxt)
 {
 	osm_switch_t* p_sw = (osm_switch_t *)p_map_item;
@@ -1226,6 +1226,7 @@ osm_ucast_mgr_process(
   struct osm_routing_engine *p_routing_eng;
   osm_signal_t signal = OSM_SIGNAL_DONE;
   cl_qmap_t *p_sw_guid_tbl;
+  boolean_t default_routing = TRUE;
 
   OSM_LOG_ENTER( p_mgr->p_log, osm_ucast_mgr_process );
 
@@ -1256,16 +1257,20 @@ osm_ucast_mgr_process(
     build and download the switch forwarding tables.
   */
 
-  if (!p_routing_eng->ucast_build_fwd_tables ||
-      p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0)
-  {
-    cl_qmap_apply_func( p_sw_guid_tbl,
-                        __osm_ucast_mgr_process_tbl, p_mgr );
-  }
+  if ( p_routing_eng->ucast_build_fwd_tables && 
+       (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) )
+     default_routing = FALSE;
+  else
+     cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr );
 
   /* dump fdb into file: */
   if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
-    __osm_ucast_mgr_dump_tables( p_mgr );
+  {
+     if ( !default_routing && p_routing_eng->ucast_dump_tables )
+        p_routing_eng->ucast_dump_tables(p_routing_eng->context);
+     else
+        __osm_ucast_mgr_dump_tables( p_mgr );
+  }
 
   if (p_mgr->any_change)
   {
-- 
1.4.4.1.GIT

 
From kliteyn at dev.mellanox.co.il  Tue Dec 19 11:54:46 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 21:54:46 +0200
Subject: [openib-general] [PATCH] osm: Improving FatTree routing engi
Message-ID: <45884386.5060106@dev.mellanox.co.il>

Hi Hal.

FatTree routing engine improvemets:
1. Improved building of LFTs
2. Setting max lid on osm switches
3. Using ucast manager LFT dump function
4. Stoped using global variable 'osm'
5. Improved logging
6. Some cosmetics

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |  439 +++++++++++++++++++++++++++---------------
 1 files changed, 281 insertions(+), 158 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 15e4cd0..0d7188a 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -57,9 +57,6 @@
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_switch.h>
 
-/* This var is predefined and initialized */
-extern osm_opensm_t osm;
-
 /*
  * FatTree rank is bounded between 2 and 8:
  *  - Tree of rank 1 has only trivial routing pathes,
@@ -211,14 +208,16 @@ typedef struct ftree_hca_t_ {
 
 typedef struct ftree_fabric_t_ 
 {
-   cl_qmap_t     hca_tbl;
-   cl_qmap_t     sw_tbl;
-   cl_qmap_t     sw_by_tuple_tbl;
-   uint32_t      tree_rank;
-   ftree_sw_t ** leaf_switches;
-   uint32_t      leaf_switches_num;
-   uint16_t      max_hcas_per_leaf;
-   cl_pool_t     sw_fwd_tbl_pool;
+   osm_opensm_t  * p_osm;
+   cl_qmap_t       hca_tbl;
+   cl_qmap_t       sw_tbl;
+   cl_qmap_t       sw_by_tuple_tbl;
+   uint32_t        tree_rank;
+   ftree_sw_t   ** leaf_switches;
+   uint32_t        leaf_switches_num;
+   uint16_t        max_hcas_per_leaf;
+   cl_pool_t       sw_fwd_tbl_pool;
+   uint16_t        lft_max_lid_ho;
 } ftree_fabric_t;
 
 /***************************************************
@@ -506,6 +505,7 @@ __osm_ftree_port_group_destroy(
 
 static void 
 __osm_ftree_port_group_dump(
+   IN  ftree_fabric_t *p_ftree,
    IN  ftree_port_group_t * p_group,
    IN  ftree_direction_t direction)
 {
@@ -517,7 +517,7 @@ __osm_ftree_port_group_dump(
    if (!p_group)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
       return;
 
    size = cl_ptr_vector_get_size(&p_group->ports);
@@ -533,7 +533,7 @@ __osm_ftree_port_group_dump(
       sprintf(buff + strlen(buff), "%u", p_port->port_num);
    }
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_port_group_dump:"
            "    Port Group of size %u, port(s): %s, direction: %s\n" 
            "                  Local <--> Remote GUID (LID):"
@@ -648,16 +648,17 @@ __osm_ftree_sw_destroy(
 
 static void 
 __osm_ftree_sw_dump(
+   IN  ftree_fabric_t * p_ftree,
    IN  ftree_sw_t * p_sw)
 {
    uint32_t i;
    if (!p_sw)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_sw_dump: "
            "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n",
           __osm_ftree_tuple_to_str(p_sw->tuple),
@@ -665,10 +666,14 @@ __osm_ftree_sw_dump(
           p_sw->down_port_groups_num, 
           p_sw->up_port_groups_num);
 
-   for( i = 0; i < p_sw->down_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN);
-   for( i = 0; i < p_sw->up_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP);
+   for( i = 0; i < p_sw->down_port_groups_num; i++ )
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_sw->down_port_groups[i],
+                                  FTREE_DIRECTION_DOWN);
+   for( i = 0; i < p_sw->up_port_groups_num; i++ )
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_sw->up_port_groups[i],
+                                  FTREE_DIRECTION_UP);
 
 } /* __osm_ftree_sw_dump() */
 
@@ -823,23 +828,26 @@ __osm_ftree_hca_destroy(
 
 static void 
 __osm_ftree_hca_dump(
+   IN  ftree_fabric_t * p_ftree,
    IN  ftree_hca_t * p_hca)
 {
    uint32_t i;
    if (!p_hca)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_hca_dump: "
            "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
           cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), 
           p_hca->up_port_groups_num);
 
    for( i = 0; i < p_hca->up_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP);
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_hca->up_port_groups[i],
+                                  FTREE_DIRECTION_UP);
 }
 
 /***************************************************/
@@ -1050,6 +1058,10 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t
    cl_qmap_insert(&p_ftree->sw_tbl,
                   p_osm_sw->p_node->node_info.node_guid,
                   &p_sw->map_item);
+
+   /* track the max lid (in host order) that exists in the fabric */
+   if (cl_ntoh16(p_sw->base_lid) > p_ftree->lft_max_lid_ho)
+      p_ftree->lft_max_lid_ho = cl_ntoh16(p_sw->base_lid);
 }
 
 /***************************************************/
@@ -1096,38 +1108,38 @@ __osm_ftree_fabric_dump(ftree_fabric_t *
    ftree_hca_t * p_hca;
    ftree_sw_t * p_sw;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
            "                       |-------------------------------|\n"
            "                       |-  Full fabric topology dump  -|\n"
            "                       |-------------------------------|\n\n");
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_fabric_dump: -- HCAs:\n");
 
    for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
          p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl);
          p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) )
    {
-      __osm_ftree_hca_dump(p_hca);
+      __osm_ftree_hca_dump(p_ftree, p_hca);
    }
 
    for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
    {
-      osm_log(&osm.log, OSM_LOG_DEBUG,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
               "__osm_ftree_fabric_dump: -- Rank %u switches\n", i);
       for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
             p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl);
             p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
       {
          if (p_sw->rank == i)
-            __osm_ftree_sw_dump(p_sw);
+            __osm_ftree_sw_dump(p_ftree, p_sw);
       }
    }
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
            "                       |---------------------------------------|\n"
            "                       |- Full fabric topology dump completed -|\n"
            "                       |---------------------------------------|\n\n");
@@ -1143,16 +1155,18 @@ __osm_ftree_fabric_dump_general_info(
    ftree_sw_t * p_sw;
    char * addition_str;
 
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n");
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "General fabric topology info\n");
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
            "============================\n");
 
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "  - FatTree rank (switches only): %u\n",
           p_ftree->tree_rank);
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "  - Fabric has %u HCAs, %u switches\n",
           cl_qmap_count(&p_ftree->hca_tbl),
           cl_qmap_count(&p_ftree->sw_tbl));
@@ -1174,13 +1188,15 @@ __osm_ftree_fabric_dump_general_info(
             addition_str = " (leaf) ";
          else
             addition_str = " ";
-         osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
-                 "  - Fabric has %u rank %u%sswitches\n",j,i,addition_str);
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+                 "__osm_ftree_fabric_dump_general_info: "
+                 "  - Fabric has %u rank %u%sswitches\n",
+                 j,i,addition_str);
    }
 
-   if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE))
+   if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE))
    {
-      osm_log(&osm.log, OSM_LOG_VERBOSE,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
               "__osm_ftree_fabric_dump_general_info: "
               "  - Root switches:\n");
       for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
@@ -1188,7 +1204,7 @@ __osm_ftree_fabric_dump_general_info(
             p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
       {
          if (p_sw->rank == 0)
-               osm_log(&osm.log, OSM_LOG_VERBOSE,
+               osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                        "__osm_ftree_fabric_dump_general_info: "
                        "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
                        cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
@@ -1196,15 +1212,17 @@ __osm_ftree_fabric_dump_general_info(
                        __osm_ftree_tuple_to_str(p_sw->tuple));
       }
 
-      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_fabric_dump_general_info: "
               "  - Leaf switches (sorted by index):\n");
       for (i = 0; i < p_ftree->leaf_switches_num; i++)
       {
-            osm_log(&osm.log, OSM_LOG_VERBOSE,
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                     "__osm_ftree_fabric_dump_general_info: "
                     "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
                     cl_ntoh64(osm_node_get_node_guid(
-                                 osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))),
+                                 osm_switch_get_node_ptr(
+                                    p_ftree->leaf_switches[i]->p_osm_sw))),
                     cl_ntoh16(p_ftree->leaf_switches[i]->base_lid),
                     __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple));
       }
@@ -1229,15 +1247,15 @@ __osm_ftree_fabric_dump_hca_ordering(
    char * filename = "osm-ftree-ca-order.dump";
 
    snprintf(path, sizeof(path), "%s/%s", 
-            osm.subn.opt.dump_files_dir, filename);
+            p_ftree->p_osm->subn.opt.dump_files_dir, filename);
    p_hca_ordering_file = fopen(path, "w");
    if (!p_hca_ordering_file) 
    {
-      osm_log(&osm.log, OSM_LOG_ERROR,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
               "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: "
               "cannot open file \'%s\': %s\n",
                filename, strerror(errno));
-      OSM_LOG_EXIT(&(osm.log));
+      OSM_LOG_EXIT(&p_ftree->p_osm->log);
       return;
    }
    
@@ -1383,9 +1401,9 @@ __osm_ftree_fabric_make_indexing(
    cl_list_t            bfs_list;
    ftree_sw_tbl_element_t * p_sw_tbl_element;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_make_indexing);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
            "Starting FatTree indexing\n");
 
    /* create array of leaf switches */
@@ -1411,8 +1429,8 @@ __osm_ftree_fabric_make_indexing(
       This fuction also adds the switch it into the switch_by_tuple table. */
    __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
-           "Indexing starting point:\n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_fabric_make_indexing: Indexing starting point:\n"
            "                                            - Switch rank  : %u\n"
            "                                            - Switch index : %s\n"
            "                                            - Node LID     : 0x%x\n"
@@ -1537,7 +1555,7 @@ __osm_ftree_fabric_make_indexing(
          sizeof(ftree_sw_t *),       /* size of each element */
          __osm_ftree_compare_switches_by_index); /* comparator */
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_make_indexing() */
 
 /***************************************************/
@@ -1555,15 +1573,17 @@ __osm_ftree_fabric_validate_topology(
    boolean_t            res = TRUE;
    uint8_t              i;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_validate_topology);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_fabric_validate_topology: "
            "Validating fabric topology\n");
 
    reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *));
    if ( reference_sw_arr == NULL )
    {
-      osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fat-tree routing: Memory allocation failed\n");
       return FALSE;
    }
    memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *));
@@ -1587,7 +1607,8 @@ __osm_ftree_fabric_validate_topology(
 
          if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num )
          {
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_validate_topology: "
                     "ERR AB09: Different number of upward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n",
@@ -1607,7 +1628,8 @@ __osm_ftree_fabric_validate_topology(
               reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num )
          {
             /* we're allowing some hca's to be missing */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_validate_topology: "
                     "ERR AB0A: Different number of downward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n",
@@ -1631,7 +1653,8 @@ __osm_ftree_fabric_validate_topology(
                 p_group = p_sw->up_port_groups[i];
                 if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
                 {
-                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                   osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                           "__osm_ftree_fabric_validate_topology: "
                            "ERR AB0B: Different number of ports in an upward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
@@ -1658,7 +1681,8 @@ __osm_ftree_fabric_validate_topology(
                 p_group = p_sw->down_port_groups[0];
                 if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
                 {
-                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                   osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                           "__osm_ftree_fabric_validate_topology: "
                            "ERR AB0C: Different number of ports in an downward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
@@ -1679,14 +1703,16 @@ __osm_ftree_fabric_validate_topology(
    } /* end of while */
 
    if (res == TRUE)
-      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: "
-                    "Fabric topology has been identified as FatTree\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_fabric_validate_topology: "
+              "Fabric topology has been identified as FatTree\n");
    else
-      osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
-                    "ERR AB0D: Fabric topology hasn't been identified as FatTree\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+              "__osm_ftree_fabric_validate_topology: "
+              "ERR AB0D: Fabric topology hasn't been identified as FatTree\n");
 
    free(reference_sw_arr);
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_validate_topology() */
 
@@ -1699,8 +1725,17 @@ __osm_ftree_set_sw_fwd_table(
    IN  void *context)
 {
    ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item;
-   memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN);
-   osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw);
+   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+
+   /* calculate lft length rounded up to a multiple of 64 (block length) */ 
+   uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64);
+
+   osm_switch_set_max_lid_ho(p_sw->p_osm_sw, p_ftree->lft_max_lid_ho);
+
+   memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, 
+          p_sw->lft_buf, 
+          lft_len);
+   osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw);
 }
 
 /***************************************************
@@ -1746,8 +1781,6 @@ __osm_ftree_fabric_route_upgoing_by_goin
    if (p_sw->down_port_groups_num == 0) 
        return;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down);
-
    /* foreach down-going port group (in indexing order) */
    for (i = 0; i < p_sw->down_port_groups_num; i++)
    {
@@ -1823,7 +1856,7 @@ __osm_ftree_fabric_route_upgoing_by_goin
          __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
                                             cl_ntoh16(target_lid),
                                             p_min_port->remote_port_num);
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_upgoing_by_going_down: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -1855,7 +1888,6 @@ __osm_ftree_fabric_route_upgoing_by_goin
    }
    /* done scanning all the down-going port groups */
 
-   OSM_LOG_EXIT(&(osm.log));
 } /* __osm_ftree_fabric_route_upgoing_by_going_down() */
 
 /***************************************************/
@@ -1892,8 +1924,6 @@ __osm_ftree_fabric_route_downgoing_by_go
    /* we shouldn't enter here if both real_lid and main_path are false */
    CL_ASSERT(is_real_lid || is_main_path);
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up);
-
    /* If this switch isn't a leaf switch:
       Assign upgoing ports by stepping down, starting on THIS switch. */
    if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
@@ -1909,10 +1939,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
    /* recursion stop condition - if it's a root switch, */
    if (p_sw->rank == 0)
-   {
-      OSM_LOG_EXIT(&(osm.log));
       return;
-   }
 
    /* Find the least loaded port of all the upgoing port groups
       (in indexing order of the remote switches). */
@@ -1982,7 +2009,7 @@ __osm_ftree_fabric_route_downgoing_by_go
    {
       if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n",
                  (is_real_lid)? "real" : "DUMMY",
@@ -2000,7 +2027,7 @@ __osm_ftree_fabric_route_downgoing_by_go
                                             cl_ntoh16(target_lid),
                                             p_min_port->remote_port_num);
          p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num;
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -2020,10 +2047,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
    /* we're done for the third case */
    if (!is_real_lid)
-   {
-      OSM_LOG_EXIT(&(osm.log));
       return;
-   }
 
    /* What's left to do at this point:
     *
@@ -2064,7 +2088,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
       if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  " - Routing SECONDARY path for LID 0x%x: %s --> %s\n",
                 cl_ntoh16(target_lid),
@@ -2087,7 +2111,6 @@ __osm_ftree_fabric_route_downgoing_by_go
             FALSE);      /* whether this is path to HCA that should by tracked by counters */
    }
 
-   OSM_LOG_EXIT(&(osm.log));
 } /* ftree_fabric_route_downgoing_by_going_up() */
 
 /***************************************************/
@@ -2114,7 +2137,7 @@ __osm_ftree_fabric_route_to_hcas(
    uint32_t             j;
    ib_net16_t           remote_lid;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas);
 
    /* for each leaf switch (in indexing order) */
    for(i = 0; i < p_ftree->leaf_switches_num; i++)
@@ -2133,7 +2156,7 @@ __osm_ftree_fabric_route_to_hcas(
          __osm_ftree_sw_set_fwd_table_block(p_sw,
                                             cl_ntoh16(remote_lid),
                                             p_port->port_num);
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_to_hcas: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_sw->tuple),
@@ -2154,7 +2177,7 @@ __osm_ftree_fabric_route_to_hcas(
 
       if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
                  "Routing %u dummy HCAs\n",
                  p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
          for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++)
@@ -2171,7 +2194,7 @@ __osm_ftree_fabric_route_to_hcas(
       }
    }
    /* done going through all the leaf switches */
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_route_to_hcas() */
 
 /***************************************************/
@@ -2195,7 +2218,7 @@ __osm_ftree_fabric_route_to_switches(
    ftree_sw_t         * p_sw;
    ftree_sw_t         * p_next_sw;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_switches);
 
    p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
    while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
@@ -2208,7 +2231,8 @@ __osm_ftree_fabric_route_to_switches(
                                          cl_ntoh16(p_sw->base_lid),
                                          0);
 
-      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_fabric_route_to_switches: "
               "Switch %s (LID 0x%x): routing switch-to-switch pathes\n",
               __osm_ftree_tuple_to_str(p_sw->tuple),
               cl_ntoh16(p_sw->base_lid));
@@ -2222,7 +2246,7 @@ __osm_ftree_fabric_route_to_switches(
             FALSE);         /* whether this path should by tracked by counters */
    }
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_route_to_switches() */
 
 /***************************************************
@@ -2234,18 +2258,17 @@ __osm_ftree_fabric_populate_switches(
 {
    osm_switch_t * p_osm_sw;
    osm_switch_t * p_next_osm_sw;
-   osm_opensm_t * p_osm = &osm;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches);
 
-   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl);
-   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) )
+   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl);
+   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) )
    {
       p_osm_sw = p_next_osm_sw;
       p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item );
       __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw);
    }
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return 0;
 } /* __osm_ftree_fabric_populate_switches() */
 
@@ -2258,12 +2281,11 @@ __osm_ftree_fabric_populate_hcas(
 {
    osm_node_t   * p_osm_node;
    osm_node_t   * p_next_osm_node;
-   osm_opensm_t * p_osm = &osm;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas);
 
-   p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl);
-   while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) )
+   p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl);
+   while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) )
    {
       p_osm_node = p_next_osm_node;
       p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item);
@@ -2278,16 +2300,17 @@ __osm_ftree_fabric_populate_hcas(
             /* all the switches added separately */
             break;
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_populate_hcas: ERR AB0E: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_osm_node)));
-            OSM_LOG_EXIT(&(osm.log));
+            OSM_LOG_EXIT(&p_ftree->p_osm->log);
             return -1;
       }
    }
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return 0;
 } /* __osm_ftree_fabric_populate_hcas() */
 
@@ -2372,7 +2395,7 @@ __osm_ftree_rank_switches_from_hca(
    static uint16_t i = 0;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca);
 
    for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++)
    {
@@ -2388,7 +2411,8 @@ __osm_ftree_rank_switches_from_hca(
       {
          case IB_NODE_TYPE_CA:
             /* HCA connected directly to another HCA - not FatTree */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_rank_switches_from_hca: ERR AB0F: "
                     "HCA conected directly to another HCA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
@@ -2405,7 +2429,8 @@ __osm_ftree_rank_switches_from_hca(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_rank_switches_from_hca: ERR AB10: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_remote_osm_node)));
@@ -2423,7 +2448,8 @@ __osm_ftree_rank_switches_from_hca(
       if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0)
          continue;
 
-      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_rank_switches_from_hca: "
               "Marking rank of switch that is directly connected to HCA:\n"
               "                                            - HCA guid   : 0x%016" PRIx64 "\n"
               "                                            - Switch guid: 0x%016" PRIx64 "\n"
@@ -2435,7 +2461,7 @@ __osm_ftree_rank_switches_from_hca(
    }
 
  Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_rank_switches_from_hca() */
 
@@ -2495,7 +2521,8 @@ __osm_ftree_fabric_construct_hca_ports(
 
          case IB_NODE_TYPE_CA:
             /* HCA connected directly to another HCA - not FatTree */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
                     "HCA conected directly to another HCA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_node)),
@@ -2508,7 +2535,8 @@ __osm_ftree_fabric_construct_hca_ports(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_hca_ports: ERR AB12: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(remote_node_guid),
                     ib_get_node_type_str(remote_node_type));
@@ -2625,7 +2653,8 @@ __osm_ftree_fabric_construct_sw_ports(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_sw_ports: ERR AB13: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(remote_node_guid),
                     ib_get_node_type_str(remote_node_type));
@@ -2646,6 +2675,10 @@ __osm_ftree_fabric_construct_sw_ports(
             remote_node_type,                           /* remote node type */           
             p_remote_hca_or_sw,                         /* remote ftree_hca/sw object */ 
             direction);                                 /* port direction (up or down) */
+
+      /* Track the max lid (in host order) that exists in the fabric */
+      if (cl_ntoh16(remote_base_lid) > p_ftree->lft_max_lid_ho)
+         p_ftree->lft_max_lid_ho = cl_ntoh16(remote_base_lid);
    }
 
  Exit:
@@ -2665,7 +2698,7 @@ __osm_ftree_fabric_perform_ranking(
    ftree_hca_t * p_next_hca;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking);
 
    /* Mark REVERSED rank of all the switches in the subnet. 
       Start from switches that are connected to hca's, and 
@@ -2678,7 +2711,8 @@ __osm_ftree_fabric_perform_ranking(
       if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0)
       {
          res = -1;
-         osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: "
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                 "__osm_ftree_fabric_perform_ranking: ERR AB14: "
                  "Subnet ranking failed - subnet is not FatTree");
          goto Exit;
       }
@@ -2686,7 +2720,8 @@ __osm_ftree_fabric_perform_ranking(
 
    /* calculate and set FatTree rank */
    __osm_ftree_fabric_calculate_rank(p_ftree);
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_perform_ranking: "
            "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree));
    
    /* fix ranking of the switches by reversing the ranking direction */
@@ -2695,7 +2730,8 @@ __osm_ftree_fabric_perform_ranking(
    if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
         __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
    {
-      osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, 
+              "__osm_ftree_fabric_perform_ranking: ERR AB15: "
               "Tree rank is %u (should be between %u and %u)\n",
               __osm_ftree_fabric_get_rank(p_ftree),
               FAT_TREE_MIN_RANK,
@@ -2705,7 +2741,7 @@ __osm_ftree_fabric_perform_ranking(
    }
 
   Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_perform_ranking() */
 
@@ -2722,7 +2758,7 @@ __osm_ftree_fabric_populate_ports(
    ftree_sw_t * p_next_sw;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_ports);
 
    p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
    while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
@@ -2748,7 +2784,7 @@ __osm_ftree_fabric_populate_ports(
       }
    }
  Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_populate_ports() */
 
@@ -2756,58 +2792,61 @@ __osm_ftree_fabric_populate_ports(
  ***************************************************/
 
 static int 
-__osm_ftree_do_routing(void *context)
+__osm_ftree_construct_fabric(
+   IN  void * context)
 {
    ftree_fabric_t * p_ftree = context;
    int status = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_construct_fabric);
 
-   if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 )
+   if ( cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl) < 2 )
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u switches - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
-              cl_qmap_count(&osm.subn.sw_guid_tbl));
+              cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl));
       status = -1;
       goto Exit;
    }
 
-   if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - 
-         cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2)
+   if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - 
+         cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
-              cl_qmap_count(&osm.subn.node_guid_tbl),
-              cl_qmap_count(&osm.subn.sw_guid_tbl));
+              cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl),
+              cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl));
       status = -1;
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
-           "                       |------------------------------|\n"
-           "                       |-  Starting FatTree Routing  -|\n"
-           "                       |------------------------------|\n\n");
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_construct_fabric: \n"
+           "                       |----------------------------------------|\n"
+           "                       |- Starting FatTree fabric construction -|\n"
+           "                       |----------------------------------------|\n\n");
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating FatTree switch table\n");
    /* ToDo: now that the pointer from node to switch exists,  
       no need to fill the switch table in a separate loop */
    if (__osm_ftree_fabric_populate_switches(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not fat-tree - "
               "falling back to default routing\n");
       status = -1;
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating FatTree HCA table\n");
    if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not fat-tree - "
               "falling back to default routing\n");
       status = -1;
@@ -2816,7 +2855,7 @@ __osm_ftree_do_routing(void *context)
 
    if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u HCAa - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
               cl_qmap_count(&p_ftree->hca_tbl));
@@ -2825,12 +2864,13 @@ __osm_ftree_do_routing(void *context)
    }
 
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Ranking FatTree\n");
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: Ranking FatTree\n");
+
    if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
    {
       if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
-         osm_log(&osm.log, OSM_LOG_SYS,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
                  "Fabric rank is %u (>%u) - "
                  "fat-tree routing falls back to default routing\n",
                  __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK);
@@ -2841,11 +2881,12 @@ __osm_ftree_do_routing(void *context)
    /* For each hca and switch, construct array of ports.
       This is done after the whole FatTree data structure is ready, because
       we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating HCA & switch ports\n");
    if (__osm_ftree_fabric_populate_ports(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not a fat-tree - "
               "routing falls back to default routing\n");
       status = -1;
@@ -2863,7 +2904,7 @@ __osm_ftree_do_routing(void *context)
    __osm_ftree_fabric_dump_general_info(p_ftree);
 
    /* dump full tree topology */
-   if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG))
+   if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
        __osm_ftree_fabric_dump(p_ftree);
 
    if (! __osm_ftree_fabric_validate_topology(p_ftree))
@@ -2872,46 +2913,118 @@ __osm_ftree_do_routing(void *context)
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
+           "Max LID in switch LFTs (in host order): 0x%x\n",
+           p_ftree->lft_max_lid_ho);
+
+ Exit:
+   if (status != 0)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_construct_fabric: "
+             "Clearing FatTree Fabric data structures\n");
+     __osm_ftree_fabric_clear(p_ftree);
+   }
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: \n"
+           "                       |--------------------------------------------------|\n"
+           "                       |- Done constructing FatTree fabric (status = %d) -|\n"
+           "                       |--------------------------------------------------|\n\n",
+           status);
+
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return status;
+}
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_do_routing(
+   IN  void * context)
+{
+   ftree_fabric_t * p_ftree = context;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_do_routing);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Starting FatTree routing\n");
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
            "Filling switch forwarding tables for routes to HCAs\n");
    __osm_ftree_fabric_route_to_hcas(p_ftree);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
            "Filling switch forwarding tables for switch-to-switch pathes\n");
    __osm_ftree_fabric_route_to_switches(p_ftree);
 
    /* for each switch, set its fwd table */
-   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL);
+   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, (void *)p_ftree);
 
    /* write out hca ordering file */
    __osm_ftree_fabric_dump_hca_ordering(p_ftree);
 
- Exit:
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Clearing FatTree Fabric data structures\n");
-   __osm_ftree_fabric_clear(p_ftree);
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "FatTree routing is done\n");
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
-           "                       |---------------------------------------|\n"
-           "                       |-  Done FatTree Routing (status = %d)  -|\n"
-           "                       |---------------------------------------|\n\n", status);
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return 0;
+}
 
-   OSM_LOG_EXIT(&(osm.log));
-   return status;
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_routing(
+   IN  void * context)
+{
+   int status = __osm_ftree_construct_fabric(context);
+   if (status != 0)
+      return status;
+
+   __osm_ftree_do_routing(context);
+   return 0;
 }
 
 /***************************************************
  ***************************************************/
 
+void
+ucast_mgr_dump_to_file(
+   IN  osm_ucast_mgr_t *p_mgr,
+   IN  const char *file_name,
+   IN  void (*func)(cl_map_item_t *, void *));
+
+void
+ucast_mgr_dump_lfts(
+   IN  cl_map_item_t *p_map_item,
+   void *cxt);
+
 static void 
-__osm_ftree_delete(void * context)
+__osm_ftree_dump_tables(
+   IN  void * context)
 {
-   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+   ftree_fabric_t * p_ftree = context;
    if (!p_ftree)
       return;
 
-   __osm_ftree_fabric_destroy(p_ftree);
+   ucast_mgr_dump_to_file(&p_ftree->p_osm->sm.ucast_mgr,
+                          "opensm-lfts.dump",
+                          ucast_mgr_dump_lfts);
+}
 
+/***************************************************
+ ***************************************************/
+
+static void 
+__osm_ftree_delete(
+   IN  void * context)
+{
+   if (!context)
+      return;
+   __osm_ftree_fabric_destroy((ftree_fabric_t *)context);
 }
 
 /***************************************************
@@ -2923,11 +3036,21 @@ int osm_ucast_ftree_setup(osm_opensm_t *
    if (!p_ftree)
       return -1;
 
+   p_ftree->p_osm = p_osm;
+
    p_osm->routing_engine.context = (void *)p_ftree;
-   p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing;
+   p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_routing;
+   /* ToDo: Resolve multicast routing. 
+    *       Until then lid matrices are built, despite the
+    *       fact that FatTree routing doesn't need them.
+    *       When the multicast routing will be resolved,
+    *       __osm_ftree_routing() function should be removed,
+    *       and here's how the FatTree routing will be set:
+    *  p_osm->routing_engine.build_lid_matrices = __osm_ftree_construct_fabric;
+    *  p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing;
+    */
+   p_osm->routing_engine.ucast_dump_tables = __osm_ftree_dump_tables;
    p_osm->routing_engine.delete = __osm_ftree_delete;
-   /* ToDo: fat-tree routing doesn't use min_hop tables, so we
-      shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */
    return 0;
 }
 
-- 
1.4.4.1.GIT


From sashak at voltaire.com  Tue Dec 19 12:30:44 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Tue, 19 Dec 2006 22:30:44 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <45883EF4.1050705@dev.mellanox.co.il>
References: <45883EF4.1050705@dev.mellanox.co.il>
Message-ID: <20061219203044.GE19795@sashak.voltaire.com>

On 21:35 Tue 19 Dec     , Yevgeny Kliteynik wrote:
> Hi Hal
> 
> Adding max_lid_ho field to osm_switch_t to allow routing
> engines that don't use lid matrices to explicitly set the
> max lid (in host order) that is reachable from the switch.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---

Looks fine. Small comments below.

> @@ -793,6 +797,8 @@ static inline uint16_t
>  osm_switch_get_max_lid_ho(
>  	IN const osm_switch_t* const p_sw )
>  {
> +	if (p_sw->max_lid_ho != 0)
> +		return p_sw->max_lid_ho;
>  	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
>  }

What do you think about to remove osm_lid_matrix_get_max_lid_ho() at
all?

> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
>  * SEE ALSO
>  *********/
>  
> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
> +* NAME
> +*	osm_switch_set_max_lid_ho
> +*
> +* DESCRIPTION
> +*	Set the maximum LID (host order) value accessed from this switch
> +* SYNOPSIS
> +*/
> +static inline void
> +osm_switch_set_max_lid_ho(
> +	IN osm_switch_t* const p_sw,
> +	IN uint16_t max_lid_ho )
> +{
> +	p_sw->max_lid_ho = max_lid_ho;
> +}
> +/*
> +* PARAMETERS
> +*	p_sw
> +*		[in] Pointer to a switch object.
> +*
> +*	max_lid_ho
> +*		Max LID (host order) value accessed from this switch
> +*
> +* RETURN VALUES
> +*	None.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +

Do we need those +31 lines of code instead of just
p_sw->max_lid_ho = N; ?

Sasha


From mst at mellanox.co.il  Tue Dec 19 13:04:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 19 Dec 2006 23:04:54 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <45882070.8040101@dev.mellanox.co.il>
References: <4587DD0B.1030403@voltaire.com>
	<20061219131625.GE30743@mellanox.co.il>
	<45882070.8040101@dev.mellanox.co.il>
Message-ID: <20061219210454.GB9321@mellanox.co.il>

> Michael S. Tsirkin wrote:
> >> The problems i see with the current approach are:
> >>
> >> 1) there are three patches
> > 
> > Not really, cma_tavor_quirk.patch is the *only* relevant tavor patch.
> > It is not 100% but the only work around for proprietary SMs.
> > Fixing the SA is a full solution.  We (Mellanox) will work with SA vendors to
> > get this addressed.  But of course this takes time.
> > 
> >> 2) of them, the cma-tavor-quirk is broken (see *** below) in its design
> >> since it assumes the opensm-tavor-quirk and it would not work with 
> >> opensm that does not have it nor with 3rd party/commercial SMs which do 
> >> not have similar quirk
> > 
> > cma-tavor-quirk in OFED 1.1 is broken but not by design -
> > the patch I posted recently fixes the bug and should work with any compliant SM.
> > I did not look at the opensm code specifically, but the
> > "15.2.5.16 PATHRECORD" is quite explicit in its requirements:
> > 
> > MtuSelector 2 432 In a query request:
> >                      3-largest MTU available
> >                   If MTU is specified (i.e., the ComponentMask bit for
> >                   MTU is 1):
> >                      0-greater than MTU specified
> >                      1-less than MTU specified
> >                      2-exactly the MTU specified
> > 
> > So if e.g. opensm does not comply (e.g. it is not returning a path where one exists)
> > we should simply fix it. If there are other broken SMs, we can look at how they
> > are broken and how to solve this.
>  
> OSM implementation in this case matches the IB spec. 
> On page 905, table 207, there's an example of such a 
> request: 
> 	Required MTU = 4 (2048)
> 	Required MTUSelector = 1 ('less-than')
> And then it is explained that the required path records
> should have MTU of 1024 or lower.
> 
> OSM implementation converts these rules to code AS IS.

In this example, since everyone must support a 2K MTU,
will opensm return a path, or fail the query?
If it fails the query it seems opensm violates the spec and
needs to be fixed.

And of course the MTU in path record query response must be 1K
or lower.

-- 
MST


From kliteyn at dev.mellanox.co.il  Tue Dec 19 13:03:28 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 23:03:28 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <20061219203044.GE19795@sashak.voltaire.com>
References: <45883EF4.1050705@dev.mellanox.co.il>
	<20061219203044.GE19795@sashak.voltaire.com>
Message-ID: <458853A0.9060909@dev.mellanox.co.il>


Sasha Khapyorsky wrote:
> On 21:35 Tue 19 Dec     , Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> Adding max_lid_ho field to osm_switch_t to allow routing
>> engines that don't use lid matrices to explicitly set the
>> max lid (in host order) that is reachable from the switch.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
> 
> Looks fine. Small comments below.
> 
>> @@ -793,6 +797,8 @@ static inline uint16_t
>>  osm_switch_get_max_lid_ho(
>>  	IN const osm_switch_t* const p_sw )
>>  {
>> +	if (p_sw->max_lid_ho != 0)
>> +		return p_sw->max_lid_ho;
>>  	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
>>  }
> 
> What do you think about to remove osm_lid_matrix_get_max_lid_ho() at
> all?

Basically, I have no objection to this. We just have to update 
the switch.max_lid_ho in the default, updn and file routings.
 
>> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
>>  * SEE ALSO
>>  *********/
>>  
>> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
>> +* NAME
>> +*	osm_switch_set_max_lid_ho
>> +*
>> +* DESCRIPTION
>> +*	Set the maximum LID (host order) value accessed from this switch
>> +* SYNOPSIS
>> +*/
>> +static inline void
>> +osm_switch_set_max_lid_ho(
>> +	IN osm_switch_t* const p_sw,
>> +	IN uint16_t max_lid_ho )
>> +{
>> +	p_sw->max_lid_ho = max_lid_ho;
>> +}
>> +/*
>> +* PARAMETERS
>> +*	p_sw
>> +*		[in] Pointer to a switch object.
>> +*
>> +*	max_lid_ho
>> +*		Max LID (host order) value accessed from this switch
>> +*
>> +* RETURN VALUES
>> +*	None.
>> +*
>> +* NOTES
>> +*
>> +* SEE ALSO
>> +*********/
>> +
> 
> Do we need those +31 lines of code instead of just
> p_sw->max_lid_ho = N; ?

Since there are access functions for the rest of the fields,
I didn't want to make an exception in this case either.

-- Yevgeny.
 
> Sasha
> 


From kliteyn at dev.mellanox.co.il  Tue Dec 19 13:43:48 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Tue, 19 Dec 2006 23:43:48 +0200
Subject: [openib-general] [PATCH] osm: Added FatTree routing to the osm
	manual
Message-ID: <45885D14.4090200@dev.mellanox.co.il>

Added FatTree routing to the osm manual

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/man/opensm.8 |   8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/osm/man/opensm.8 b/osm/man/opensm.8
index 316232d..225918d 100644
--- a/osm/man/opensm.8
+++ b/osm/man/opensm.8
@@ -391,7 +391,7 @@ Examples:
 
 .SH ROUTING
 .PP
-OpenSM offers two routing engines:
+OpenSM offers three routing engines:
 
 1.  Min Hop Algorithm - based on the minimum hops to each node where the
 path length is optimized.
@@ -401,6 +401,12 @@ node, but it is constrained to ranking r
 if the subnet is not a pure Fat Tree, and deadlock may occur due to a
 loop in the subnet.
 
+3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing 
+for congestion-free "shift" communication pattern.
+It should be chosen if a subnet is a symmetrical Fat Trees of various types,
+not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
+Similar to UPDN, Fat Tree routing is constrained to ranking rules.
+
 OpenSM also supports a file method which can load routes from a table. See
 \'Modular Routing Engine\' for more information on this.
 
-- 
1.4.4.1.GIT


From robert.j.woodruff at intel.com  Tue Dec 19 14:26:38 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Tue, 19 Dec 2006 14:26:38 -0800
Subject: [openib-general] OFED 1.2 git tree
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C015F0553@orsmsx418.amr.corp.intel.com>


Hi Tziporet,

I took a look at the OFED 1.2 git tree, daily builds and how to wiki.
As for the OFED 1.2 tree, I was able to clone it and get it running
with the OFA userspace git tree, although I did it manually as the 
build scripts for OFED are not intuitively obvious on how to use.
I did look at the How to build OFED 1.2 Wiki and I think it could use a
bit more work,
as I was not able to take the daily build tar balls and easily make a
dist tar ball from them,
so a little more exact step by step instructions on your wiki would be
helpful. 

As we discussed yesterday in the OFED meeting, for the OFED 1.2 kernel
(based on 2.6.20-rc1)
you have to checkout Sean and Arlins rdma_cm branch of the userspace
code if you are 
not doing so already, I also updated the how to check out OFA code from
git, on the wiki.

https://openib.org/tiki/tiki-index.php?page=Downloading+Code+From+the+OF
A+git+Repositories

As for the local_sa cache and multicast branches of Sean's trees. He is
still based on
2.6.19. I took a quick look at trying to port this to the OFED_1.2 tree
based on 2.6.20-rc1
and it looks like it needs a few more changes than I want to deal with
this week while he
is out on vacation. Probably best to wait for his return to port the
code up to the 2.6.20-rc1
code base, but I see no problem with getting this ready for the Jan 30
feature freeze date.

woody


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/9de40bf5/attachment.html>

From halr at voltaire.com  Tue Dec 19 14:27:32 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 17:27:32 -0500
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <45883EF4.1050705@dev.mellanox.co.il>
References: <45883EF4.1050705@dev.mellanox.co.il>
Message-ID: <1166567251.4519.442.camel@hal.voltaire.com>

Hi Yevgeny,

On Tue, 2006-12-19 at 14:35, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> Adding max_lid_ho field to osm_switch_t to allow routing
> engines that don't use lid matrices to explicitly set the
> max lid (in host order) that is reachable from the switch.

One minor comment below.

> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/include/opensm/osm_switch.h |   37 +++++++++++++++++++++++++++++++++++++
>  osm/opensm/osm_switch.c         |    2 ++
>  2 files changed, 39 insertions(+), 0 deletions(-)
> 
> diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h
> index 4570f61..d2089bd 100644
> --- a/osm/include/opensm/osm_switch.h
> +++ b/osm/include/opensm/osm_switch.h
> @@ -107,6 +107,7 @@ typedef struct _osm_switch
>  	ib_switch_info_t			switch_info;
>  	osm_fwd_tbl_t				fwd_tbl;
>  	osm_lid_matrix_t			lmx;
> +	uint16_t				max_lid_ho;
>  	osm_port_profile_t			*p_prof;
>  	osm_mcast_tbl_t				mcast_tbl;
>  	uint32_t				discovery_count;
> @@ -129,6 +130,9 @@ typedef struct _osm_switch
>  *		LID Matrix for this switch containing the hop count
>  *		to every LID from every port.
>  *
> +*	max_lid_ho
> +*		Max LID that is accessible from this switch
> +* 
>  *	p_pro
>  *		Pointer to array of Port Profile objects for this switch.
>  *
> @@ -793,6 +797,8 @@ static inline uint16_t
>  osm_switch_get_max_lid_ho(
>  	IN const osm_switch_t* const p_sw )
>  {
> +	if (p_sw->max_lid_ho != 0)
> +		return p_sw->max_lid_ho;
>  	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
>  }
>  /*
> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
>  * SEE ALSO
>  *********/
>  
> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
> +* NAME
> +*	osm_switch_set_max_lid_ho
> +*
> +* DESCRIPTION
> +*	Set the maximum LID (host order) value accessed from this switch
> +* SYNOPSIS
> +*/
> +static inline void
> +osm_switch_set_max_lid_ho(
> +	IN osm_switch_t* const p_sw,
> +	IN uint16_t max_lid_ho )
> +{
> +	p_sw->max_lid_ho = max_lid_ho;
> +}
> +/*
> +* PARAMETERS
> +*	p_sw
> +*		[in] Pointer to a switch object.
> +*
> +*	max_lid_ho
> +*		Max LID (host order) value accessed from this switch
> +*
> +* RETURN VALUES
> +*	None.
> +*
> +* NOTES
> +*
> +* SEE ALSO
> +*********/
> +
>  /****f* OpenSM: Switch/osm_switch_get_num_ports
>  * NAME
>  *	osm_switch_get_num_ports
> diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
> index 0dd3de5..4ca713a 100644
> --- a/osm/opensm/osm_switch.c
> +++ b/osm/opensm/osm_switch.c
> @@ -122,6 +122,8 @@ osm_switch_init(
>    for( port_num = 0; port_num < num_ports; port_num++ )
>      osm_port_prof_construct( &p_sw->p_prof[port_num] );
>  
> +  p_sw->max_lid_ho = 0;

This isn't really needed, is it ?

Doesn't osm_switch_construct clear this ?

-- Hal

> +
>   Exit:
>    return( status );
>  }


From Ashish.Batwara at lsi.com  Tue Dec 19 14:43:07 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Tue, 19 Dec 2006 15:43:07 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com>

Hi,

Here is the info that you have asked. I am seeing the Subnet manager is
up now having the port active. But server is not able to discover the
target. I am seeing the error "Got failed path rec status -110" on Linux
console. Below are the output of different commands. I am using
following to discover the target:

 
/etc/init.d/opensmd start

/etc/init.d/openibd start

modprobe ib_srp

echo
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
/sys/class/infiniband_srp/srp-mthca0-2/add_target 

 
[root at p49 ~]# ibv_devinfo

hca_id: mthca0

        fw_ver:                         5.1.400

        node_guid:                      0002:c902:0022:cce0

        sys_image_guid:                 0002:c902:0022:cce3

        vendor_id:                      0x02c9

        vendor_part_id:                 25218

        hw_ver:                         0xA0

        board_id:                       MT_0370130002

        phys_port_cnt:                  2

                port:   1

                        state:                  PORT_DOWN (1)

                        max_mtu:                2048 (4)

                        active_mtu:             512 (2)

                        sm_lid:                 0

                        port_lid:               0

                        port_lmc:               0x00

 
                port:   2

                        state:                  PORT_ACTIVE (4)

                        max_mtu:                2048 (4)

                        active_mtu:             2048 (4)

                        sm_lid:                 1

                        port_lid:               1

                        port_lmc:               0x00

 
hca_id: mthca1

        fw_ver:                         5.1.400

        node_guid:                      0002:c902:0022:cd2c

        sys_image_guid:                 0002:c902:0022:cd2f

        vendor_id:                      0x02c9

        vendor_part_id:                 25218

        hw_ver:                         0xA0

        board_id:                       MT_0370130002

        phys_port_cnt:                  2

                port:   1

                        state:                  PORT_DOWN (1)

                        max_mtu:                2048 (4)

                        active_mtu:             512 (2)

                        sm_lid:                 0

                        port_lid:               0

                        port_lmc:               0x00

 
                port:   2

                        state:                  PORT_DOWN (1)

                        max_mtu:                2048 (4)

                        active_mtu:             512 (2)

                        sm_lid:                 0

                        port_lid:               0

                        port_lmc:               0x00

 
[root at p49 ~]# uname -a

Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31 EDT
2006 x86_64 x86_64 x86_64 GNU/Linux

 
[root at p49 ~]# cat /etc/infiniband/info

#!/bin/bash

 
echo prefix=/usr/local/ofed

echo Kernel=2.6.9-42.0.3.ELsmp

echo

echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm
--with-libibcommon --with-libibmad --with-libibumad --with-libibverbs
--with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
--with-libsdp --with-openib-diags --with-srptools --with-mstflint
--with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
--with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
--with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"

echo

 
OFED Version: OFED-1.1

 
Thanks

Ashish

-----Original Message-----
From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
Sent: Tuesday, December 19, 2006 5:18 AM
To: Batwara, Ashish
Cc: ishai at mellanox.co.il; openib-general at openib.org
Subject: Re: [openib-general] opensm

 
Hi Ashish,

 
SRP people say they have no such error message.

OpenSM does. So I take it back.

 
Ashish,

Please provide more into:

 
1. ibv_devinfo

2. Version of code you are using

3. Command line you use for starting opensm

4. /var/log/osm.log

 
Thanks and sorry for the confusion.

 
EZ

 
Eitan Zahavi wrote:

> This is not an OpenSM issue.

> Forwarded to the SRP people.

> 

> EZ

> Batwara, Ashish wrote:

>   

>> Hi,

>> I am trying to run opensm on Linux server. It has two HCAs (4-ports)
and

>> connected to IB Switch. ibnodes command displays the information
about

>> the Switch ports and HCA ports.

>> When I start opensm, I see in /var/log/messages "Starting srp_daemon"

>> for all the 4 ports and immediately after I see "failed srp_daemon"
for

>> all the ports and the displays "SM Port is down".

>> 

>> I tried several times and even rebooted the server few times but no

>> luck.

>> 

>> Does anybody know what this problem is?

>> 

>> Thanks

>> Ashish

>> 

>> _______________________________________________

>> openib-general mailing list

>> openib-general at openib.org

>> http://openib.org/mailman/listinfo/openib-general

>> 

>> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

>>   

>>     

> 

> 

> _______________________________________________

> openib-general mailing list

> openib-general at openib.org

> http://openib.org/mailman/listinfo/openib-general

> 

> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

>   

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/2fb2bf67/attachment.html>

From sashak at voltaire.com  Tue Dec 19 15:05:53 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 20 Dec 2006 01:05:53 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <458853A0.9060909@dev.mellanox.co.il>
References: <45883EF4.1050705@dev.mellanox.co.il>
	<20061219203044.GE19795@sashak.voltaire.com>
	<458853A0.9060909@dev.mellanox.co.il>
Message-ID: <20061219230553.GG19795@sashak.voltaire.com>

On 23:03 Tue 19 Dec     , Yevgeny Kliteynik wrote:
>  
> >> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
> >>  * SEE ALSO
> >>  *********/
> >>  
> >> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
> >> +* NAME
> >> +*	osm_switch_set_max_lid_ho
> >> +*
> >> +* DESCRIPTION
> >> +*	Set the maximum LID (host order) value accessed from this switch
> >> +* SYNOPSIS
> >> +*/
> >> +static inline void
> >> +osm_switch_set_max_lid_ho(
> >> +	IN osm_switch_t* const p_sw,
> >> +	IN uint16_t max_lid_ho )
> >> +{
> >> +	p_sw->max_lid_ho = max_lid_ho;
> >> +}
> >> +/*
> >> +* PARAMETERS
> >> +*	p_sw
> >> +*		[in] Pointer to a switch object.
> >> +*
> >> +*	max_lid_ho
> >> +*		Max LID (host order) value accessed from this switch
> >> +*
> >> +* RETURN VALUES
> >> +*	None.
> >> +*
> >> +* NOTES
> >> +*
> >> +* SEE ALSO
> >> +*********/
> >> +
> > 
> > Do we need those +31 lines of code instead of just
> > p_sw->max_lid_ho = N; ?
> 
> Since there are access functions for the rest of the fields,
> I didn't want to make an exception in this case either.

I think you did anyway - there is no full set of access methods. I'm
perfectly fine with it. And don't call you to cleanup the rest, just to
not add new ones.

Sasha


From halr at voltaire.com  Tue Dec 19 15:06:26 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 18:06:26 -0500
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159CEA@NAMAIL2.ad.lsil.com>
Message-ID: <1166569585.4519.2439.camel@hal.voltaire.com>

Ashish,

On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> Hi,
> 
> Here is the info that you have asked. I am seeing the Subnet manager
> is up now having the port active. But server is not able to discover
> the target. I am seeing the error “Got failed path rec status -110” on
> Linux console. 

That means the request for an SA PathRecord from the initiator to the
target failed (-110 is ETIMEDOUT). Are you sure the target is up
(ACTIVE) on the subnet ? If it is, can you send the opensm log ?

-- Hal

> Below are the output of different commands. I am using following to
> discover the target:
> 
>  
> 
> /etc/init.d/opensmd start
> 
> /etc/init.d/openibd start
> 
> modprobe ib_srp
> 
> echo
> id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 > /sys/class/infiniband_srp/srp-mthca0-2/add_target 
> 
>  
> 
>  
> 
> [root at p49 ~]# ibv_devinfo
> 
> hca_id: mthca0
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cce0
> 
>         sys_image_guid:                 0002:c902:0022:cce3
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_ACTIVE (4)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             2048 (4)
> 
>                         sm_lid:                 1
> 
>                         port_lid:               1
> 
>                         port_lmc:               0x00
> hca_id: mthca1
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cd2c
> 
>         sys_image_guid:                 0002:c902:0022:cd2f
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>  
> 
> [root at p49 ~]# uname -a
> 
> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> 
>  
> 
> [root at p49 ~]# cat /etc/infiniband/info
> 
> #!/bin/bash
> 
>  
> 
> echo prefix=/usr/local/ofed
> 
> echo Kernel=2.6.9-42.0.3.ELsmp
> 
> echo
> 
> echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm
> --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs
> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> 
> echo
> 
>  
> 
> OFED Version: OFED-1.1


> 
> Thanks
> 
> Ashish
> 
> -----Original Message-----
> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> Sent: Tuesday, December 19, 2006 5:18 AM
> To: Batwara, Ashish
> Cc: ishai at mellanox.co.il; openib-general at openib.org
> Subject: Re: [openib-general] opensm
> 
>  
> 
> Hi Ashish,
> 
>  
> 
> SRP people say they have no such error message.
> 
> OpenSM does. So I take it back.
> 
>  
> 
> Ashish,
> 
> Please provide more into:
> 
>  
> 
> 1. ibv_devinfo
> 
> 2. Version of code you are using
> 
> 3. Command line you use for starting opensm
> 
> 4. /var/log/osm.log
> 
>  
> 
> Thanks and sorry for the confusion.
> 
>  
> 
> EZ
> 
>  
> 
> Eitan Zahavi wrote:
> 
> > This is not an OpenSM issue.
> 
> > Forwarded to the SRP people.
> 
> > 
> 
> > EZ
> 
> > Batwara, Ashish wrote:
> 
> >   
> 
> >> Hi,
> 
> >> I am trying to run opensm on Linux server. It has two HCAs
> (4-ports) and
> 
> >> connected to IB Switch. ibnodes command displays the information
> about
> 
> >> the Switch ports and HCA ports.
> 
> >> When I start opensm, I see in /var/log/messages "Starting
> srp_daemon"
> 
> >> for all the 4 ports and immediately after I see "failed srp_daemon"
> for
> 
> >> all the ports and the displays "SM Port is down".
> 
> >> 
> 
> >> I tried several times and even rebooted the server few times but no
> 
> >> luck.
> 
> >> 
> 
> >> Does anybody know what this problem is?
> 
> >> 
> 
> >> Thanks
> 
> >> Ashish
> 
> >> 
> 
> >> _______________________________________________
> 
> >> openib-general mailing list
> 
> >> openib-general at openib.org
> 
> >> http://openib.org/mailman/listinfo/openib-general
> 
> >> 
> 
> >> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >>   
> 
> >>     
> 
> > 
> 
> > 
> 
> > _______________________________________________
> 
> > openib-general mailing list
> 
> > openib-general at openib.org
> 
> > http://openib.org/mailman/listinfo/openib-general
> 
> > 
> 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >   
> 
>  
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


From Ashish.Batwara at lsi.com  Tue Dec 19 15:22:03 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Tue, 19 Dec 2006 16:22:03 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com>

Hi,
Please look towards the end of the attached file.

Thanks
Ashish

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Tuesday, December 19, 2006 5:06 PM
To: Batwara, Ashish
Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
Subject: Re: [openib-general] opensm

Ashish,

On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> Hi,
> 
> Here is the info that you have asked. I am seeing the Subnet manager
> is up now having the port active. But server is not able to discover
> the target. I am seeing the error "Got failed path rec status -110" on
> Linux console. 

That means the request for an SA PathRecord from the initiator to the
target failed (-110 is ETIMEDOUT). Are you sure the target is up
(ACTIVE) on the subnet ? If it is, can you send the opensm log ?

-- Hal

> Below are the output of different commands. I am using following to
> discover the target:
> 
>  
> 
> /etc/init.d/opensmd start
> 
> /etc/init.d/openibd start
> 
> modprobe ib_srp
> 
> echo
>
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
/sys/class/infiniband_srp/srp-mthca0-2/add_target 
> 
>  
> 
>  
> 
> [root at p49 ~]# ibv_devinfo
> 
> hca_id: mthca0
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cce0
> 
>         sys_image_guid:                 0002:c902:0022:cce3
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_ACTIVE (4)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             2048 (4)
> 
>                         sm_lid:                 1
> 
>                         port_lid:               1
> 
>                         port_lmc:               0x00
> hca_id: mthca1
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cd2c
> 
>         sys_image_guid:                 0002:c902:0022:cd2f
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>  
> 
> [root at p49 ~]# uname -a
> 
> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> 
>  
> 
> [root at p49 ~]# cat /etc/infiniband/info
> 
> #!/bin/bash
> 
>  
> 
> echo prefix=/usr/local/ofed
> 
> echo Kernel=2.6.9-42.0.3.ELsmp
> 
> echo
> 
> echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm
> --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs
> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> 
> echo
> 
>  
> 
> OFED Version: OFED-1.1


> 
> Thanks
> 
> Ashish
> 
> -----Original Message-----
> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> Sent: Tuesday, December 19, 2006 5:18 AM
> To: Batwara, Ashish
> Cc: ishai at mellanox.co.il; openib-general at openib.org
> Subject: Re: [openib-general] opensm
> 
>  
> 
> Hi Ashish,
> 
>  
> 
> SRP people say they have no such error message.
> 
> OpenSM does. So I take it back.
> 
>  
> 
> Ashish,
> 
> Please provide more into:
> 
>  
> 
> 1. ibv_devinfo
> 
> 2. Version of code you are using
> 
> 3. Command line you use for starting opensm
> 
> 4. /var/log/osm.log
> 
>  
> 
> Thanks and sorry for the confusion.
> 
>  
> 
> EZ
> 
>  
> 
> Eitan Zahavi wrote:
> 
> > This is not an OpenSM issue.
> 
> > Forwarded to the SRP people.
> 
> > 
> 
> > EZ
> 
> > Batwara, Ashish wrote:
> 
> >   
> 
> >> Hi,
> 
> >> I am trying to run opensm on Linux server. It has two HCAs
> (4-ports) and
> 
> >> connected to IB Switch. ibnodes command displays the information
> about
> 
> >> the Switch ports and HCA ports.
> 
> >> When I start opensm, I see in /var/log/messages "Starting
> srp_daemon"
> 
> >> for all the 4 ports and immediately after I see "failed srp_daemon"
> for
> 
> >> all the ports and the displays "SM Port is down".
> 
> >> 
> 
> >> I tried several times and even rebooted the server few times but no
> 
> >> luck.
> 
> >> 
> 
> >> Does anybody know what this problem is?
> 
> >> 
> 
> >> Thanks
> 
> >> Ashish
> 
> >> 
> 
> >> _______________________________________________
> 
> >> openib-general mailing list
> 
> >> openib-general at openib.org
> 
> >> http://openib.org/mailman/listinfo/openib-general
> 
> >> 
> 
> >> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >>   
> 
> >>     
> 
> > 
> 
> > 
> 
> > _______________________________________________
> 
> > openib-general mailing list
> 
> > openib-general at openib.org
> 
> > http://openib.org/mailman/listinfo/openib-general
> 
> > 
> 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >   
> 
>  
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

-------------- next part --------------
A non-text attachment was scrubbed...
Name: osm.log
Type: application/octet-stream
Size: 1846569 bytes
Desc: osm.log
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061219/5d889583/attachment.obj>

From Ashish.Batwara at lsi.com  Tue Dec 19 17:12:18 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Tue, 19 Dec 2006 18:12:18 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159D2F@NAMAIL2.ad.lsil.com>

Logs from the end of the osm.log:


Dec 19 15:48:26 984523 [43204960] -> SUBNET UP
Dec 19 15:48:36 985477 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b1d) --
dropping
Dec 19 15:48:36 985538 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:48:36 985560 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:48:36 985643 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b1d
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:48:36 985728 [42803960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:48:36 985754 [42803960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:48:36 986161 [42803960] -> SUBNET UP
Dec 19 15:48:46 986814 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b22) --
dropping
Dec 19 15:48:46 986868 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:48:46 986895 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:48:46 986935 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b22
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:48:46 987025 [41401960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:48:46 987050 [41401960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:48:46 987459 [41401960] -> SUBNET UP
Dec 19 15:48:56 988475 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b27) --
dropping
Dec 19 15:48:56 988536 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:48:56 988562 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:48:56 988601 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b27
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:48:56 988681 [41E02960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:48:56 988706 [41E02960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:48:56 989146 [41E02960] -> SUBNET UP
Dec 19 15:49:06 990152 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b2c) --
dropping
Dec 19 15:49:06 990209 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:06 990231 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:06 990292 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b2c
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:06 990375 [43204960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:06 990399 [43204960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:06 990815 [43204960] -> SUBNET UP
Dec 19 15:49:16 991042 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b31) --
dropping
Dec 19 15:49:16 991095 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:16 991122 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:16 991174 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b31
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:16 991281 [41401960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:16 991306 [41401960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:16 991719 [41401960] -> SUBNET UP
Dec 19 15:49:26 992226 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b36) --
dropping
Dec 19 15:49:26 992280 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:26 992306 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:26 992347 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b36
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:26 992442 [42803960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:26 992468 [42803960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:26 993031 [42803960] -> SUBNET UP
Dec 19 15:49:36 995288 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b3b) --
dropping
Dec 19 15:49:36 995341 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:36 995360 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:36 995428 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b3b
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:36 995515 [43204960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:36 995538 [43204960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:36 996077 [43204960] -> SUBNET UP
Dec 19 15:49:46 995190 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b40) --
dropping
Dec 19 15:49:46 995243 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:46 995265 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:46 995308 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b40
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:46 995383 [42803960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:46 995407 [42803960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:46 995960 [42803960] -> SUBNET UP
Dec 19 15:49:56 997558 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b45) --
dropping
Dec 19 15:49:56 997609 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:49:56 997624 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:49:56 997663 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b45
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:49:56 997780 [43204960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:49:56 997805 [43204960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:49:56 998216 [43204960] -> SUBNET UP
Dec 19 15:50:06 999247 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b4a) --
dropping
Dec 19 15:50:06 999296 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:50:06 999311 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:50:06 999351 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b4a
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:50:06 999425 [42803960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:50:06 999487 [42803960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:50:06 999996 [42803960] -> SUBNET UP
Dec 19 15:50:17 003083 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b4f) --
dropping
Dec 19 15:50:17 003139 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:50:17 003159 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:50:17 003217 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b4f
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:50:17 003297 [41401960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:50:17 003360 [41401960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:50:17 003779 [41401960] -> SUBNET UP
Dec 19 15:50:27 002576 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b54) --
dropping
Dec 19 15:50:27 002663 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:50:27 002683 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:50:27 002744 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b54
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:50:27 002837 [41E02960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:50:27 002891 [41E02960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:50:27 003312 [41E02960] -> SUBNET UP
Dec 19 15:50:37 004082 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b59) --
dropping
Dec 19 15:50:37 004139 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:50:37 004162 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:50:37 004205 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b59
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:50:37 004290 [42803960] -> osm_drop_mgr_process: ERR 0108:
Unknown remote side for node 0x0002c9020022cce0 port 2. Adding to light
sweep sampling list
Dec 19 15:50:37 004315 [42803960] -> Directed Path Dump of 0 hop path:
				Path = [0]
Dec 19 15:50:37 004730 [42803960] -> SUBNET UP
Dec 19 15:50:46 205115 [42803960] -> SM port is down
Dec 19 15:50:56 206763 [42803960] -> SM port is down
Dec 19 15:50:56 206903 [42803960] -> __osm_sm_state_mgr_signal_error:
ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Dec 19 15:51:06 209285 [42803960] -> SM port is down
Dec 19 15:51:06 209448 [42803960] -> __osm_sm_state_mgr_signal_error:
ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Dec 19 15:51:16 209877 [41E02960] -> SM port is down
Dec 19 15:51:16 210032 [41E02960] -> __osm_sm_state_mgr_signal_error:
ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Dec 19 15:51:26 210935 [41401960] -> SM port is down
Dec 19 15:51:26 211100 [41401960] -> __osm_sm_state_mgr_signal_error:
ERR 3207: Invalid signal OSM_SM_SIGNAL_DISCOVER in state
IB_SMINFO_STATE_DISCOVERING
Dec 19 15:51:36 214582 [41E02960] -> Entering MASTER state
Dec 19 15:51:36 228305 [42803960] -> SUBNET UP
Dec 19 15:51:36 992447 [41E02960] -> __osm_trap_rcv_process_request:
Received Generic Notice type:0x04 num:144 Producer:1 from LID:0x0009
TID:0x0000000000000003
Dec 19 15:51:36 992663 [41E02960] -> osm_report_notice: Reporting
Generic Notice type:4 num:144 from LID:0x0009
GID:0xfe80000000000000,0x0002c9020022cd26
Dec 19 15:51:36 994495 [41401960] -> SUBNET UP
Dec 19 15:51:47 014297 [45007960] -> umad_receiver: ERR 5409: send
completed with error (method=0x1 attr=0x11 trans_id=0x2500001b89) --
dropping
Dec 19 15:51:47 014371 [45007960] -> umad_receiver: ERR 5411: DR SMP
Dec 19 15:51:47 014386 [45007960] -> __osm_sm_mad_ctrl_send_err_cb: ERR
3113: MAD completed in error (IB_TIMEOUT)
Dec 19 15:51:47 014426 [45007960] -> SMP dump:
				base_ver................0x1
				mgmt_class..............0x81
				class_ver...............0x1
				method..................0x1 (SubnGet)
				D bit...................0x0
				status..................0x0
				hop_ptr.................0x0
				hop_count...............0x1
				trans_id................0x1b89
				attr_id.................0x11 (NodeInfo)
				resv....................0x0
				attr_mod................0x0
	
m_key...................0x0000000000000000
				dr_slid.................0xFFFF
				dr_dlid.................0xFFFF

				Initial path: [0][2]
				Return path:  [0][0]
				Reserved:     [0][0][0][0][0][0][0]

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

				00 00 00 00 00 00 00 00   00 00 00 00 00
00 00 00

Dec 19 15:51:47 014531 [41E02960] -> osm_report_notice: Reporting
Generic Notice type:3 num:65 from LID:0x0001
GID:0xfe80000000000000,0x0002c9020022cce2
Dec 19 15:51:47 014552 [41E02960] -> Removed port with
GUID:0x0002c9020022cd26 LID range [0x9,0x9] of node:Native Infiniband
Storage - LSI Logic, Engenio Storage Group
Dec 19 15:51:47 014570 [41E02960] -> osm_report_notice: Reporting
Generic Notice type:3 num:65 from LID:0x0001
GID:0xfe80000000000000,0x0002c9020022cce2
Dec 19 15:51:47 014586 [41E02960] -> Removed port with
GUID:0x0002c9020022cce2 LID range [0x1,0x1] of node:p49 HCA-1
Dec 19 15:51:47 014658 [41E02960] -> __osm_lid_mgr_process_our_sm_node:
ERR 0308: Can't acquire SM's port object, GUID = 0x0002c9020022cce2
Dec 19 15:51:47 015001 [41E02960] -> SUBNET UP
Dec 19 15:51:51 371737 [41401960] -> osm_pr_rcv_process: ERR 1F16:
Cannot find requester physical port
Dec 19 15:51:56 216932 [41401960] -> osm_report_notice: Reporting
Generic Notice type:3 num:64 from LID:0x0001
GID:0xfe80000000000000,0x0002c9020022cce2
Dec 19 15:51:56 217034 [41401960] -> Discovered new port with
GUID:0x0002c9020022cce2 LID range [0x1,0x1] of node:p49 HCA-1
Dec 19 15:51:56 217045 [41401960] -> osm_report_notice: Reporting
Generic Notice type:3 num:64 from LID:0x0001
GID:0xfe80000000000000,0x0002c9020022cce2
Dec 19 15:51:56 217122 [41401960] -> Discovered new port with
GUID:0x0002c9020022cd26 LID range [0x9,0x9] of node:Native Infiniband
Storage - LSI Logic, Engenio Storage Group
Dec 19 15:51:56 217432 [41401960] -> SUBNET UP
Dec 19 15:52:06 217884 [43204960] -> SUBNET UP
Dec 19 15:52:16 222523 [42803960] -> SUBNET UP
Dec 19 15:52:26 221109 [42803960] -> SUBNET UP
Dec 19 15:52:36 222369 [42803960] -> SUBNET UP
Dec 19 15:52:46 224523 [41401960] -> SUBNET UP
Dec 19 15:52:52 902536 [95AB6160] -> Exiting SM
Dec 19 15:54:17 354494 [95AB6160] -> OpenSM Rev:openib-2.0.5 OpenIB svn
Exported revision
Dec 19 17:09:20 792650 [95AB6160] -> OpenSM Rev:openib-2.0.5 OpenIB svn
Exported revision

-----Original Message-----
From: Batwara, Ashish 
Sent: Tuesday, December 19, 2006 5:22 PM
To: 'Hal Rosenstock'
Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
Subject: RE: [openib-general] opensm

Hi,
Please look towards the end of the attached file.

Thanks
Ashish

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Tuesday, December 19, 2006 5:06 PM
To: Batwara, Ashish
Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
Subject: Re: [openib-general] opensm

Ashish,

On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> Hi,
> 
> Here is the info that you have asked. I am seeing the Subnet manager
> is up now having the port active. But server is not able to discover
> the target. I am seeing the error "Got failed path rec status -110" on
> Linux console. 

That means the request for an SA PathRecord from the initiator to the
target failed (-110 is ETIMEDOUT). Are you sure the target is up
(ACTIVE) on the subnet ? If it is, can you send the opensm log ?

-- Hal

> Below are the output of different commands. I am using following to
> discover the target:
> 
>  
> 
> /etc/init.d/opensmd start
> 
> /etc/init.d/openibd start
> 
> modprobe ib_srp
> 
> echo
>
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
/sys/class/infiniband_srp/srp-mthca0-2/add_target 
> 
>  
> 
>  
> 
> [root at p49 ~]# ibv_devinfo
> 
> hca_id: mthca0
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cce0
> 
>         sys_image_guid:                 0002:c902:0022:cce3
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_ACTIVE (4)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             2048 (4)
> 
>                         sm_lid:                 1
> 
>                         port_lid:               1
> 
>                         port_lmc:               0x00
> hca_id: mthca1
> 
>         fw_ver:                         5.1.400
> 
>         node_guid:                      0002:c902:0022:cd2c
> 
>         sys_image_guid:                 0002:c902:0022:cd2f
> 
>         vendor_id:                      0x02c9
> 
>         vendor_part_id:                 25218
> 
>         hw_ver:                         0xA0
> 
>         board_id:                       MT_0370130002
> 
>         phys_port_cnt:                  2
> 
>                 port:   1
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>                 port:   2
> 
>                         state:                  PORT_DOWN (1)
> 
>                         max_mtu:                2048 (4)
> 
>                         active_mtu:             512 (2)
> 
>                         sm_lid:                 0
> 
>                         port_lid:               0
> 
>                         port_lmc:               0x00
> 
>  
> 
>  
> 
> [root at p49 ~]# uname -a
> 
> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> 
>  
> 
> [root at p49 ~]# cat /etc/infiniband/info
> 
> #!/bin/bash
> 
>  
> 
> echo prefix=/usr/local/ofed
> 
> echo Kernel=2.6.9-42.0.3.ELsmp
> 
> echo
> 
> echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm
> --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs
> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> 
> echo
> 
>  
> 
> OFED Version: OFED-1.1


> 
> Thanks
> 
> Ashish
> 
> -----Original Message-----
> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> Sent: Tuesday, December 19, 2006 5:18 AM
> To: Batwara, Ashish
> Cc: ishai at mellanox.co.il; openib-general at openib.org
> Subject: Re: [openib-general] opensm
> 
>  
> 
> Hi Ashish,
> 
>  
> 
> SRP people say they have no such error message.
> 
> OpenSM does. So I take it back.
> 
>  
> 
> Ashish,
> 
> Please provide more into:
> 
>  
> 
> 1. ibv_devinfo
> 
> 2. Version of code you are using
> 
> 3. Command line you use for starting opensm
> 
> 4. /var/log/osm.log
> 
>  
> 
> Thanks and sorry for the confusion.
> 
>  
> 
> EZ
> 
>  
> 
> Eitan Zahavi wrote:
> 
> > This is not an OpenSM issue.
> 
> > Forwarded to the SRP people.
> 
> > 
> 
> > EZ
> 
> > Batwara, Ashish wrote:
> 
> >   
> 
> >> Hi,
> 
> >> I am trying to run opensm on Linux server. It has two HCAs
> (4-ports) and
> 
> >> connected to IB Switch. ibnodes command displays the information
> about
> 
> >> the Switch ports and HCA ports.
> 
> >> When I start opensm, I see in /var/log/messages "Starting
> srp_daemon"
> 
> >> for all the 4 ports and immediately after I see "failed srp_daemon"
> for
> 
> >> all the ports and the displays "SM Port is down".
> 
> >> 
> 
> >> I tried several times and even rebooted the server few times but no
> 
> >> luck.
> 
> >> 
> 
> >> Does anybody know what this problem is?
> 
> >> 
> 
> >> Thanks
> 
> >> Ashish
> 
> >> 
> 
> >> _______________________________________________
> 
> >> openib-general mailing list
> 
> >> openib-general at openib.org
> 
> >> http://openib.org/mailman/listinfo/openib-general
> 
> >> 
> 
> >> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >>   
> 
> >>     
> 
> > 
> 
> > 
> 
> > _______________________________________________
> 
> > openib-general mailing list
> 
> > openib-general at openib.org
> 
> > http://openib.org/mailman/listinfo/openib-general
> 
> > 
> 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 
> >   
> 
>  
> 
> 
> 
> ______________________________________________________________________
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general


From vishal at endace.com  Tue Dec 19 18:03:17 2006
From: vishal at endace.com (vishal)
Date: Wed, 20 Dec 2006 15:03:17 +1300
Subject: [openib-general] iSER target
Message-ID: <1166580197.6798.2.camel@julia.et.endace.com>

Hi,

    I would like to confirm if the iSER target code in the gen2 branch
is functional. If yes, is there a readme/installation guide available...

Thanks a lot!

Vishal


From halr at voltaire.com  Tue Dec 19 20:35:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 19 Dec 2006 23:35:00 -0500
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159CFD@NAMAIL2.ad.lsil.com>
Message-ID: <1166589299.4519.18010.camel@hal.voltaire.com>

On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote:
> Hi,
> Please look towards the end of the attached file.

What options are you starting opensm with ? What is the command line ?

Also, it looks like (at least at one point) you have another SM on the
subnet. What is the make (vendor) for your switch ?

I see many SM port is DOWN. What is going on with this port ? Why is the
physical link not LinkUp and stable ? That is the main issue and is
likely why the SubnGet of NodeInfo is not being responded to.

-- Hal

> Thanks
> Ashish
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, December 19, 2006 5:06 PM
> To: Batwara, Ashish
> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> Subject: Re: [openib-general] opensm
> 
> Ashish,
> 
> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> > Hi,
> > 
> > Here is the info that you have asked. I am seeing the Subnet manager
> > is up now having the port active. But server is not able to discover
> > the target. I am seeing the error "Got failed path rec status -110" on
> > Linux console. 
> 
> That means the request for an SA PathRecord from the initiator to the
> target failed (-110 is ETIMEDOUT). Are you sure the target is up
> (ACTIVE) on the subnet ? If it is, can you send the opensm log ?
> 
> -- Hal
> 
> > Below are the output of different commands. I am using following to
> > discover the target:
> > 
> >  
> > 
> > /etc/init.d/opensmd start
> > 
> > /etc/init.d/openibd start
> > 
> > modprobe ib_srp
> > 
> > echo
> >
> id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
> /sys/class/infiniband_srp/srp-mthca0-2/add_target 
> > 
> >  
> > 
> >  
> > 
> > [root at p49 ~]# ibv_devinfo
> > 
> > hca_id: mthca0
> > 
> >         fw_ver:                         5.1.400
> > 
> >         node_guid:                      0002:c902:0022:cce0
> > 
> >         sys_image_guid:                 0002:c902:0022:cce3
> > 
> >         vendor_id:                      0x02c9
> > 
> >         vendor_part_id:                 25218
> > 
> >         hw_ver:                         0xA0
> > 
> >         board_id:                       MT_0370130002
> > 
> >         phys_port_cnt:                  2
> > 
> >                 port:   1
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >                 port:   2
> > 
> >                         state:                  PORT_ACTIVE (4)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             2048 (4)
> > 
> >                         sm_lid:                 1
> > 
> >                         port_lid:               1
> > 
> >                         port_lmc:               0x00
> > hca_id: mthca1
> > 
> >         fw_ver:                         5.1.400
> > 
> >         node_guid:                      0002:c902:0022:cd2c
> > 
> >         sys_image_guid:                 0002:c902:0022:cd2f
> > 
> >         vendor_id:                      0x02c9
> > 
> >         vendor_part_id:                 25218
> > 
> >         hw_ver:                         0xA0
> > 
> >         board_id:                       MT_0370130002
> > 
> >         phys_port_cnt:                  2
> > 
> >                 port:   1
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >                 port:   2
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >  
> > 
> > [root at p49 ~]# uname -a
> > 
> > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > 
> >  
> > 
> > [root at p49 ~]# cat /etc/infiniband/info
> > 
> > #!/bin/bash
> > 
> >  
> > 
> > echo prefix=/usr/local/ofed
> > 
> > echo Kernel=2.6.9-42.0.3.ELsmp
> > 
> > echo
> > 
> > echo "Configure options: --with-dapl --with-ipoibtools --with-libibcm
> > --with-libibcommon --with-libibmad --with-libibumad --with-libibverbs
> > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> > --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> > 
> > echo
> > 
> >  
> > 
> > OFED Version: OFED-1.1
> 
> 
> 
> > 
> > Thanks
> > 
> > Ashish
> > 
> > -----Original Message-----
> > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> > Sent: Tuesday, December 19, 2006 5:18 AM
> > To: Batwara, Ashish
> > Cc: ishai at mellanox.co.il; openib-general at openib.org
> > Subject: Re: [openib-general] opensm
> > 
> >  
> > 
> > Hi Ashish,
> > 
> >  
> > 
> > SRP people say they have no such error message.
> > 
> > OpenSM does. So I take it back.
> > 
> >  
> > 
> > Ashish,
> > 
> > Please provide more into:
> > 
> >  
> > 
> > 1. ibv_devinfo
> > 
> > 2. Version of code you are using
> > 
> > 3. Command line you use for starting opensm
> > 
> > 4. /var/log/osm.log
> > 
> >  
> > 
> > Thanks and sorry for the confusion.
> > 
> >  
> > 
> > EZ
> > 
> >  
> > 
> > Eitan Zahavi wrote:
> > 
> > > This is not an OpenSM issue.
> > 
> > > Forwarded to the SRP people.
> > 
> > > 
> > 
> > > EZ
> > 
> > > Batwara, Ashish wrote:
> > 
> > >   
> > 
> > >> Hi,
> > 
> > >> I am trying to run opensm on Linux server. It has two HCAs
> > (4-ports) and
> > 
> > >> connected to IB Switch. ibnodes command displays the information
> > about
> > 
> > >> the Switch ports and HCA ports.
> > 
> > >> When I start opensm, I see in /var/log/messages "Starting
> > srp_daemon"
> > 
> > >> for all the 4 ports and immediately after I see "failed srp_daemon"
> > for
> > 
> > >> all the ports and the displays "SM Port is down".
> > 
> > >> 
> > 
> > >> I tried several times and even rebooted the server few times but no
> > 
> > >> luck.
> > 
> > >> 
> > 
> > >> Does anybody know what this problem is?
> > 
> > >> 
> > 
> > >> Thanks
> > 
> > >> Ashish
> > 
> > >> 
> > 
> > >> _______________________________________________
> > 
> > >> openib-general mailing list
> > 
> > >> openib-general at openib.org
> > 
> > >> http://openib.org/mailman/listinfo/openib-general
> > 
> > >> 
> > 
> > >> To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > >>   
> > 
> > >>     
> > 
> > > 
> > 
> > > 
> > 
> > > _______________________________________________
> > 
> > > openib-general mailing list
> > 
> > > openib-general at openib.org
> > 
> > > http://openib.org/mailman/listinfo/openib-general
> > 
> > > 
> > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > >   
> > 
> >  
> > 
> > 
> > 
> > ______________________________________________________________________
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 


From ogerlitz at voltaire.com  Tue Dec 19 23:48:09 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 20 Dec 2006 09:48:09 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061219160221.GE3428@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il>
Message-ID: <4588EAB9.6080106@voltaire.com>

Michael S. Tsirkin wrote:
> I am not yet sure what is best for upstream, so I don't really think we need
> any RFCs.

> We'll need data from SM guys on whether MTU selector actually works
> in SMs, and if not what happens when you enable it.

Eitan,

Can you please post here the tavor-quirk patch which was integrated into 
opensm? i can see the ***code*** of the opensm but might make some wrong 
assumptions or get into wrong understandings as i am not able to see the 
patch as is.

Or.


From kliteyn at dev.mellanox.co.il  Wed Dec 20 00:48:53 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 10:48:53 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <1166567251.4519.442.camel@hal.voltaire.com>
References: <45883EF4.1050705@dev.mellanox.co.il>
	<1166567251.4519.442.camel@hal.voltaire.com>
Message-ID: <4588F8F5.70007@dev.mellanox.co.il>


Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Tue, 2006-12-19 at 14:35, Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> Adding max_lid_ho field to osm_switch_t to allow routing
>> engines that don't use lid matrices to explicitly set the
>> max lid (in host order) that is reachable from the switch.
> 
> One minor comment below.
> 
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/include/opensm/osm_switch.h |   37 +++++++++++++++++++++++++++++++++++++
>>  osm/opensm/osm_switch.c         |    2 ++
>>  2 files changed, 39 insertions(+), 0 deletions(-)
>>
>> diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h
>> index 4570f61..d2089bd 100644
>> --- a/osm/include/opensm/osm_switch.h
>> +++ b/osm/include/opensm/osm_switch.h
>> @@ -107,6 +107,7 @@ typedef struct _osm_switch
>>  	ib_switch_info_t			switch_info;
>>  	osm_fwd_tbl_t				fwd_tbl;
>>  	osm_lid_matrix_t			lmx;
>> +	uint16_t				max_lid_ho;
>>  	osm_port_profile_t			*p_prof;
>>  	osm_mcast_tbl_t				mcast_tbl;
>>  	uint32_t				discovery_count;
>> @@ -129,6 +130,9 @@ typedef struct _osm_switch
>>  *		LID Matrix for this switch containing the hop count
>>  *		to every LID from every port.
>>  *
>> +*	max_lid_ho
>> +*		Max LID that is accessible from this switch
>> +* 
>>  *	p_pro
>>  *		Pointer to array of Port Profile objects for this switch.
>>  *
>> @@ -793,6 +797,8 @@ static inline uint16_t
>>  osm_switch_get_max_lid_ho(
>>  	IN const osm_switch_t* const p_sw )
>>  {
>> +	if (p_sw->max_lid_ho != 0)
>> +		return p_sw->max_lid_ho;
>>  	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
>>  }
>>  /*
>> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
>>  * SEE ALSO
>>  *********/
>>  
>> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
>> +* NAME
>> +*	osm_switch_set_max_lid_ho
>> +*
>> +* DESCRIPTION
>> +*	Set the maximum LID (host order) value accessed from this switch
>> +* SYNOPSIS
>> +*/
>> +static inline void
>> +osm_switch_set_max_lid_ho(
>> +	IN osm_switch_t* const p_sw,
>> +	IN uint16_t max_lid_ho )
>> +{
>> +	p_sw->max_lid_ho = max_lid_ho;
>> +}
>> +/*
>> +* PARAMETERS
>> +*	p_sw
>> +*		[in] Pointer to a switch object.
>> +*
>> +*	max_lid_ho
>> +*		Max LID (host order) value accessed from this switch
>> +*
>> +* RETURN VALUES
>> +*	None.
>> +*
>> +* NOTES
>> +*
>> +* SEE ALSO
>> +*********/
>> +
>>  /****f* OpenSM: Switch/osm_switch_get_num_ports
>>  * NAME
>>  *	osm_switch_get_num_ports
>> diff --git a/osm/opensm/osm_switch.c b/osm/opensm/osm_switch.c
>> index 0dd3de5..4ca713a 100644
>> --- a/osm/opensm/osm_switch.c
>> +++ b/osm/opensm/osm_switch.c
>> @@ -122,6 +122,8 @@ osm_switch_init(
>>    for( port_num = 0; port_num < num_ports; port_num++ )
>>      osm_port_prof_construct( &p_sw->p_prof[port_num] );
>>  
>> +  p_sw->max_lid_ho = 0;
> 
> This isn't really needed, is it ?
> 
> Doesn't osm_switch_construct clear this ?

Right, it does.
I will issue a V2 series of patches that will address this and Sasha's 
comments.

 
> -- Hal
> 
>> +
>>   Exit:
>>    return( status );
>>  }
> 


From kliteyn at dev.mellanox.co.il  Wed Dec 20 00:49:12 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 10:49:12 +0200
Subject: [openib-general] [PATCH] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <20061219230553.GG19795@sashak.voltaire.com>
References: <45883EF4.1050705@dev.mellanox.co.il>
	<20061219203044.GE19795@sashak.voltaire.com>
	<458853A0.9060909@dev.mellanox.co.il>
	<20061219230553.GG19795@sashak.voltaire.com>
Message-ID: <4588F908.8050306@dev.mellanox.co.il>


Sasha Khapyorsky wrote:
> On 23:03 Tue 19 Dec     , Yevgeny Kliteynik wrote:
>>  
>>>> @@ -809,6 +815,37 @@ osm_switch_get_max_lid_ho(
>>>>  * SEE ALSO
>>>>  *********/
>>>>  
>>>> +/****f* OpenSM: Switch/osm_switch_set_max_lid_ho
>>>> +* NAME
>>>> +*	osm_switch_set_max_lid_ho
>>>> +*
>>>> +* DESCRIPTION
>>>> +*	Set the maximum LID (host order) value accessed from this switch
>>>> +* SYNOPSIS
>>>> +*/
>>>> +static inline void
>>>> +osm_switch_set_max_lid_ho(
>>>> +	IN osm_switch_t* const p_sw,
>>>> +	IN uint16_t max_lid_ho )
>>>> +{
>>>> +	p_sw->max_lid_ho = max_lid_ho;
>>>> +}
>>>> +/*
>>>> +* PARAMETERS
>>>> +*	p_sw
>>>> +*		[in] Pointer to a switch object.
>>>> +*
>>>> +*	max_lid_ho
>>>> +*		Max LID (host order) value accessed from this switch
>>>> +*
>>>> +* RETURN VALUES
>>>> +*	None.
>>>> +*
>>>> +* NOTES
>>>> +*
>>>> +* SEE ALSO
>>>> +*********/
>>>> +
>>> Do we need those +31 lines of code instead of just
>>> p_sw->max_lid_ho = N; ?
>> Since there are access functions for the rest of the fields,
>> I didn't want to make an exception in this case either.
> 
> I think you did anyway - there is no full set of access methods. I'm
> perfectly fine with it. And don't call you to cleanup the rest, just to
> not add new ones.

You're right - setter is not needed.
I will issue a V2 series of patches that will address this and Hal's 
comments.
 
-- Yevgeny

> Sasha
> 


From danb at voltaire.com  Wed Dec 20 00:54:16 2006
From: danb at voltaire.com (Dan Bar Dov)
Date: Wed, 20 Dec 2006 10:54:16 +0200
Subject: [openib-general] iSER target
Message-ID: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com>

The iser target code in the gen2 branch is functional
over kdapl. It requires an iscsi target code above it,
however such an iscsi code is not open.

It was opened as a precursor for an open-source iscsi/iser-target
project. That project is still in its early stages, and the plan is
to add iser-target support, loosly based on the open-iser-target 
code, to the stgt project.

Due to the above, there is no readme/installation guide.

Dan

> -----Original Message-----
> From: openib-general-bounces at openib.org 
> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal
> Sent: Wednesday, December 20, 2006 4:03 AM
> To: openib-general at openib.org
> Subject: [openib-general] iSER target
> 
> Hi,
> 
>     I would like to confirm if the iSER target code in the gen2 branch
> is functional. If yes, is there a readme/installation guide 
> available...
> 
> Thanks a lot!
> 
> Vishal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general
> 
> 


From kliteyn at dev.mellanox.co.il  Wed Dec 20 00:51:56 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 10:51:56 +0200
Subject: [openib-general] [PATCHv2] osm: adding max_lid_ho field to
	osm_switch_t
Message-ID: <4588F9AC.5040401@dev.mellanox.co.il>

Hi Hal

[V2 of the patch - removed setter and unnecessary initialization]

Adding max_lid_ho field to osm_switch_t to allow routing
engines that don't use lid matrices to explicitly set the
max lid (in host order) that is reachable from the switch.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/opensm/osm_switch.h |   6 ++++++
 1 file changed, 6 insertions(+), 0 deletions(-)

diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h
index 4570f61..d2089bd 100644
--- a/osm/include/opensm/osm_switch.h
+++ b/osm/include/opensm/osm_switch.h
@@ -107,6 +107,7 @@ typedef struct _osm_switch
 	ib_switch_info_t			switch_info;
 	osm_fwd_tbl_t				fwd_tbl;
 	osm_lid_matrix_t			lmx;
+	uint16_t				max_lid_ho;
 	osm_port_profile_t			*p_prof;
 	osm_mcast_tbl_t				mcast_tbl;
 	uint32_t				discovery_count;
@@ -129,6 +130,9 @@ typedef struct _osm_switch
 *		LID Matrix for this switch containing the hop count
 *		to every LID from every port.
 *
+*	max_lid_ho
+*		Max LID that is accessible from this switch
+* 
 *	p_pro
 *		Pointer to array of Port Profile objects for this switch.
 *
@@ -793,6 +797,8 @@ static inline uint16_t
 osm_switch_get_max_lid_ho(
 	IN const osm_switch_t* const p_sw )
 {
+	if (p_sw->max_lid_ho != 0)
+		return p_sw->max_lid_ho;
 	return( osm_lid_matrix_get_max_lid_ho( &p_sw->lmx ) );
 }
 /*
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 20 00:54:50 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 10:54:50 +0200
Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine
Message-ID: <4588FA5A.1070802@dev.mellanox.co.il>

Hi Hal

[V2 of the patch - not using max_lid_ho setter]

FatTree routing engine improvemets:
1. Improved building of LFTs
2. Setting max lid on osm switches
3. Using ucast manager LFT dump function
4. Stoped using global variable 'osm'
5. Improved logging
6. Some cosmetics

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |  439 +++++++++++++++++++++++++++---------------
 1 files changed, 281 insertions(+), 158 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 15e4cd0..0d7188a 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -57,9 +57,6 @@
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_switch.h>
 
-/* This var is predefined and initialized */
-extern osm_opensm_t osm;
-
 /*
  * FatTree rank is bounded between 2 and 8:
  *  - Tree of rank 1 has only trivial routing pathes,
@@ -211,14 +208,16 @@ typedef struct ftree_hca_t_ {
 
 typedef struct ftree_fabric_t_ 
 {
-   cl_qmap_t     hca_tbl;
-   cl_qmap_t     sw_tbl;
-   cl_qmap_t     sw_by_tuple_tbl;
-   uint32_t      tree_rank;
-   ftree_sw_t ** leaf_switches;
-   uint32_t      leaf_switches_num;
-   uint16_t      max_hcas_per_leaf;
-   cl_pool_t     sw_fwd_tbl_pool;
+   osm_opensm_t  * p_osm;
+   cl_qmap_t       hca_tbl;
+   cl_qmap_t       sw_tbl;
+   cl_qmap_t       sw_by_tuple_tbl;
+   uint32_t        tree_rank;
+   ftree_sw_t   ** leaf_switches;
+   uint32_t        leaf_switches_num;
+   uint16_t        max_hcas_per_leaf;
+   cl_pool_t       sw_fwd_tbl_pool;
+   uint16_t        lft_max_lid_ho;
 } ftree_fabric_t;
 
 /***************************************************
@@ -506,6 +505,7 @@ __osm_ftree_port_group_destroy(
 
 static void 
 __osm_ftree_port_group_dump(
+   IN  ftree_fabric_t *p_ftree,
    IN  ftree_port_group_t * p_group,
    IN  ftree_direction_t direction)
 {
@@ -517,7 +517,7 @@ __osm_ftree_port_group_dump(
    if (!p_group)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
       return;
 
    size = cl_ptr_vector_get_size(&p_group->ports);
@@ -533,7 +533,7 @@ __osm_ftree_port_group_dump(
       sprintf(buff + strlen(buff), "%u", p_port->port_num);
    }
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_port_group_dump:"
            "    Port Group of size %u, port(s): %s, direction: %s\n" 
            "                  Local <--> Remote GUID (LID):"
@@ -648,16 +648,17 @@ __osm_ftree_sw_destroy(
 
 static void 
 __osm_ftree_sw_dump(
+   IN  ftree_fabric_t * p_ftree,
    IN  ftree_sw_t * p_sw)
 {
    uint32_t i;
    if (!p_sw)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_sw_dump: "
            "Switch index: %s, GUID: 0x%016" PRIx64 ", Ports: %u DOWN, %u UP\n",
           __osm_ftree_tuple_to_str(p_sw->tuple),
@@ -665,10 +666,14 @@ __osm_ftree_sw_dump(
           p_sw->down_port_groups_num, 
           p_sw->up_port_groups_num);
 
-   for( i = 0; i < p_sw->down_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_sw->down_port_groups[i], FTREE_DIRECTION_DOWN);
-   for( i = 0; i < p_sw->up_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_sw->up_port_groups[i], FTREE_DIRECTION_UP);
+   for( i = 0; i < p_sw->down_port_groups_num; i++ )
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_sw->down_port_groups[i],
+                                  FTREE_DIRECTION_DOWN);
+   for( i = 0; i < p_sw->up_port_groups_num; i++ )
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_sw->up_port_groups[i],
+                                  FTREE_DIRECTION_UP);
 
 } /* __osm_ftree_sw_dump() */
 
@@ -823,23 +828,26 @@ __osm_ftree_hca_destroy(
 
 static void 
 __osm_ftree_hca_dump(
+   IN  ftree_fabric_t * p_ftree,
    IN  ftree_hca_t * p_hca)
 {
    uint32_t i;
    if (!p_hca)
       return;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_hca_dump: "
            "HCA GUID: 0x%016" PRIx64 ", Ports: %u UP\n",
           cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)), 
           p_hca->up_port_groups_num);
 
    for( i = 0; i < p_hca->up_port_groups_num; i++ ) 
-      __osm_ftree_port_group_dump(p_hca->up_port_groups[i],FTREE_DIRECTION_UP);
+      __osm_ftree_port_group_dump(p_ftree,
+                                  p_hca->up_port_groups[i],
+                                  FTREE_DIRECTION_UP);
 }
 
 /***************************************************/
@@ -1050,6 +1058,10 @@ __osm_ftree_fabric_add_sw(ftree_fabric_t
    cl_qmap_insert(&p_ftree->sw_tbl,
                   p_osm_sw->p_node->node_info.node_guid,
                   &p_sw->map_item);
+
+   /* track the max lid (in host order) that exists in the fabric */
+   if (cl_ntoh16(p_sw->base_lid) > p_ftree->lft_max_lid_ho)
+      p_ftree->lft_max_lid_ho = cl_ntoh16(p_sw->base_lid);
 }
 
 /***************************************************/
@@ -1096,38 +1108,38 @@ __osm_ftree_fabric_dump(ftree_fabric_t *
    ftree_hca_t * p_hca;
    ftree_sw_t * p_sw;
 
-   if (!osm_log_is_active(&osm.log,OSM_LOG_DEBUG))
+   if (!osm_log_is_active(&p_ftree->p_osm->log,OSM_LOG_DEBUG))
       return;
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
            "                       |-------------------------------|\n"
            "                       |-  Full fabric topology dump  -|\n"
            "                       |-------------------------------|\n\n");
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
            "__osm_ftree_fabric_dump: -- HCAs:\n");
 
    for ( p_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
          p_hca != (ftree_hca_t *)cl_qmap_end(&p_ftree->hca_tbl);
          p_hca = (ftree_hca_t *)cl_qmap_next(&p_hca->map_item) )
    {
-      __osm_ftree_hca_dump(p_hca);
+      __osm_ftree_hca_dump(p_ftree, p_hca);
    }
 
    for (i = 0; i < __osm_ftree_fabric_get_rank(p_ftree); i++)
    {
-      osm_log(&osm.log, OSM_LOG_DEBUG,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
               "__osm_ftree_fabric_dump: -- Rank %u switches\n", i);
       for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
             p_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl);
             p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
       {
          if (p_sw->rank == i)
-            __osm_ftree_sw_dump(p_sw);
+            __osm_ftree_sw_dump(p_ftree, p_sw);
       }
    }
 
-   osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_dump: \n"
            "                       |---------------------------------------|\n"
            "                       |- Full fabric topology dump completed -|\n"
            "                       |---------------------------------------|\n\n");
@@ -1143,16 +1155,18 @@ __osm_ftree_fabric_dump_general_info(
    ftree_sw_t * p_sw;
    char * addition_str;
 
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info:\n");
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "General fabric topology info\n");
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
            "============================\n");
 
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "  - FatTree rank (switches only): %u\n",
           p_ftree->tree_rank);
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_dump_general_info: "
            "  - Fabric has %u HCAs, %u switches\n",
           cl_qmap_count(&p_ftree->hca_tbl),
           cl_qmap_count(&p_ftree->sw_tbl));
@@ -1174,13 +1188,15 @@ __osm_ftree_fabric_dump_general_info(
             addition_str = " (leaf) ";
          else
             addition_str = " ";
-         osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_dump_general_info: "
-                 "  - Fabric has %u rank %u%sswitches\n",j,i,addition_str);
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+                 "__osm_ftree_fabric_dump_general_info: "
+                 "  - Fabric has %u rank %u%sswitches\n",
+                 j,i,addition_str);
    }
 
-   if (osm_log_is_active(&osm.log,OSM_LOG_VERBOSE))
+   if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_VERBOSE))
    {
-      osm_log(&osm.log, OSM_LOG_VERBOSE,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
               "__osm_ftree_fabric_dump_general_info: "
               "  - Root switches:\n");
       for ( p_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
@@ -1188,7 +1204,7 @@ __osm_ftree_fabric_dump_general_info(
             p_sw = (ftree_sw_t *)cl_qmap_next(&p_sw->map_item) )
       {
          if (p_sw->rank == 0)
-               osm_log(&osm.log, OSM_LOG_VERBOSE,
+               osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                        "__osm_ftree_fabric_dump_general_info: "
                        "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
                        cl_ntoh64(osm_node_get_node_guid(osm_switch_get_node_ptr(p_sw->p_osm_sw))),
@@ -1196,15 +1212,17 @@ __osm_ftree_fabric_dump_general_info(
                        __osm_ftree_tuple_to_str(p_sw->tuple));
       }
 
-      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_dump_general_info: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_fabric_dump_general_info: "
               "  - Leaf switches (sorted by index):\n");
       for (i = 0; i < p_ftree->leaf_switches_num; i++)
       {
-            osm_log(&osm.log, OSM_LOG_VERBOSE,
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
                     "__osm_ftree_fabric_dump_general_info: "
                     "      GUID: 0x%016" PRIx64 ", LID: 0x%x, Index %s\n",
                     cl_ntoh64(osm_node_get_node_guid(
-                                 osm_switch_get_node_ptr(p_ftree->leaf_switches[i]->p_osm_sw))),
+                                 osm_switch_get_node_ptr(
+                                    p_ftree->leaf_switches[i]->p_osm_sw))),
                     cl_ntoh16(p_ftree->leaf_switches[i]->base_lid),
                     __osm_ftree_tuple_to_str(p_ftree->leaf_switches[i]->tuple));
       }
@@ -1229,15 +1247,15 @@ __osm_ftree_fabric_dump_hca_ordering(
    char * filename = "osm-ftree-ca-order.dump";
 
    snprintf(path, sizeof(path), "%s/%s", 
-            osm.subn.opt.dump_files_dir, filename);
+            p_ftree->p_osm->subn.opt.dump_files_dir, filename);
    p_hca_ordering_file = fopen(path, "w");
    if (!p_hca_ordering_file) 
    {
-      osm_log(&osm.log, OSM_LOG_ERROR,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
               "__osm_ftree_fabric_dump_hca_ordering: ERR AB01: "
               "cannot open file \'%s\': %s\n",
                filename, strerror(errno));
-      OSM_LOG_EXIT(&(osm.log));
+      OSM_LOG_EXIT(&p_ftree->p_osm->log);
       return;
    }
    
@@ -1383,9 +1401,9 @@ __osm_ftree_fabric_make_indexing(
    cl_list_t            bfs_list;
    ftree_sw_tbl_element_t * p_sw_tbl_element;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_make_indexing);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_make_indexing);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
            "Starting FatTree indexing\n");
 
    /* create array of leaf switches */
@@ -1411,8 +1429,8 @@ __osm_ftree_fabric_make_indexing(
       This fuction also adds the switch it into the switch_by_tuple table. */
    __osm_ftree_fabric_assign_first_tuple(p_ftree,p_sw);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_make_indexing: "
-           "Indexing starting point:\n"
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_fabric_make_indexing: Indexing starting point:\n"
            "                                            - Switch rank  : %u\n"
            "                                            - Switch index : %s\n"
            "                                            - Node LID     : 0x%x\n"
@@ -1537,7 +1555,7 @@ __osm_ftree_fabric_make_indexing(
          sizeof(ftree_sw_t *),       /* size of each element */
          __osm_ftree_compare_switches_by_index); /* comparator */
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_make_indexing() */
 
 /***************************************************/
@@ -1555,15 +1573,17 @@ __osm_ftree_fabric_validate_topology(
    boolean_t            res = TRUE;
    uint8_t              i;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_validate_topology);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_validate_topology);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE, "__osm_ftree_fabric_validate_topology: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_fabric_validate_topology: "
            "Validating fabric topology\n");
 
    reference_sw_arr = (ftree_sw_t **)malloc(tree_rank * sizeof(ftree_sw_t *));
    if ( reference_sw_arr == NULL )
    {
-      osm_log(&osm.log, OSM_LOG_SYS,"Fat-tree routing: Memory allocation failed\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+              "Fat-tree routing: Memory allocation failed\n");
       return FALSE;
    }
    memset(reference_sw_arr, 0, tree_rank * sizeof(ftree_sw_t *));
@@ -1587,7 +1607,8 @@ __osm_ftree_fabric_validate_topology(
 
          if ( reference_sw_arr[p_sw->rank]->up_port_groups_num != p_sw->up_port_groups_num )
          {
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_validate_topology: "
                     "ERR AB09: Different number of upward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u groups\n",
@@ -1607,7 +1628,8 @@ __osm_ftree_fabric_validate_topology(
               reference_sw_arr[p_sw->rank]->down_port_groups_num != p_sw->down_port_groups_num )
          {
             /* we're allowing some hca's to be missing */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_validate_topology: "
                     "ERR AB0A: Different number of downward port groups on switches:\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n"
                     "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u port groups\n",
@@ -1631,7 +1653,8 @@ __osm_ftree_fabric_validate_topology(
                 p_group = p_sw->up_port_groups[i];
                 if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
                 {
-                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                   osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                           "__osm_ftree_fabric_validate_topology: "
                            "ERR AB0B: Different number of ports in an upward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
@@ -1658,7 +1681,8 @@ __osm_ftree_fabric_validate_topology(
                 p_group = p_sw->down_port_groups[0];
                 if (cl_ptr_vector_get_size(&p_ref_group->ports) != cl_ptr_vector_get_size(&p_group->ports))
                 {
-                   osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
+                   osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                           "__osm_ftree_fabric_validate_topology: "
                            "ERR AB0C: Different number of ports in an downward port group on switches:\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n"
                            "       GUID 0x%016" PRIx64 ", LID 0x%x, Index %s - %u ports\n",
@@ -1679,14 +1703,16 @@ __osm_ftree_fabric_validate_topology(
    } /* end of while */
 
    if (res == TRUE)
-      osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_fabric_validate_topology: "
-                    "Fabric topology has been identified as FatTree\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_fabric_validate_topology: "
+              "Fabric topology has been identified as FatTree\n");
    else
-      osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_validate_topology: "
-                    "ERR AB0D: Fabric topology hasn't been identified as FatTree\n");
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+              "__osm_ftree_fabric_validate_topology: "
+              "ERR AB0D: Fabric topology hasn't been identified as FatTree\n");
 
    free(reference_sw_arr);
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_validate_topology() */
 
@@ -1699,8 +1725,17 @@ __osm_ftree_set_sw_fwd_table(
    IN  void *context)
 {
    ftree_sw_t * p_sw = (ftree_sw_t * const) p_map_item;
-   memcpy(osm.sm.ucast_mgr.lft_buf, p_sw->lft_buf, FTREE_FWD_TBL_LEN);
-   osm_ucast_mgr_set_fwd_table(&osm.sm.ucast_mgr,p_sw->p_osm_sw);
+   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+
+   /* calculate lft length rounded up to a multiple of 64 (block length) */ 
+   uint16_t lft_len = 64 * ((p_ftree->lft_max_lid_ho + 1 + 63) / 64);
+
+   p_sw->p_osm_sw->max_lid_ho = p_ftree->lft_max_lid_ho;
+
+   memcpy(p_ftree->p_osm->sm.ucast_mgr.lft_buf, 
+          p_sw->lft_buf, 
+          lft_len);
+   osm_ucast_mgr_set_fwd_table(&p_ftree->p_osm->sm.ucast_mgr, p_sw->p_osm_sw);
 }
 
 /***************************************************
@@ -1746,8 +1781,6 @@ __osm_ftree_fabric_route_upgoing_by_goin
    if (p_sw->down_port_groups_num == 0) 
        return;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_upgoing_by_going_down);
-
    /* foreach down-going port group (in indexing order) */
    for (i = 0; i < p_sw->down_port_groups_num; i++)
    {
@@ -1823,7 +1856,7 @@ __osm_ftree_fabric_route_upgoing_by_goin
          __osm_ftree_sw_set_fwd_table_block(p_remote_sw,
                                             cl_ntoh16(target_lid),
                                             p_min_port->remote_port_num);
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_upgoing_by_going_down: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -1855,7 +1888,6 @@ __osm_ftree_fabric_route_upgoing_by_goin
    }
    /* done scanning all the down-going port groups */
 
-   OSM_LOG_EXIT(&(osm.log));
 } /* __osm_ftree_fabric_route_upgoing_by_going_down() */
 
 /***************************************************/
@@ -1892,8 +1924,6 @@ __osm_ftree_fabric_route_downgoing_by_go
    /* we shouldn't enter here if both real_lid and main_path are false */
    CL_ASSERT(is_real_lid || is_main_path);
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_downgoing_by_going_up);
-
    /* If this switch isn't a leaf switch:
       Assign upgoing ports by stepping down, starting on THIS switch. */
    if (p_sw->rank != (__osm_ftree_fabric_get_rank(p_ftree) - 1))
@@ -1909,10 +1939,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
    /* recursion stop condition - if it's a root switch, */
    if (p_sw->rank == 0)
-   {
-      OSM_LOG_EXIT(&(osm.log));
       return;
-   }
 
    /* Find the least loaded port of all the upgoing port groups
       (in indexing order of the remote switches). */
@@ -1982,7 +2009,7 @@ __osm_ftree_fabric_route_downgoing_by_go
    {
       if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  " - Routing MAIN path for %s HCA LID 0x%x: %s --> %s\n",
                  (is_real_lid)? "real" : "DUMMY",
@@ -2000,7 +2027,7 @@ __osm_ftree_fabric_route_downgoing_by_go
                                             cl_ntoh16(target_lid),
                                             p_min_port->remote_port_num);
          p_remote_sw->lft_buf[cl_ntoh16(target_lid)] = p_min_port->remote_port_num;
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_remote_sw->tuple),
@@ -2020,10 +2047,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
    /* we're done for the third case */
    if (!is_real_lid)
-   {
-      OSM_LOG_EXIT(&(osm.log));
       return;
-   }
 
    /* What's left to do at this point:
     *
@@ -2064,7 +2088,7 @@ __osm_ftree_fabric_route_downgoing_by_go
 
       if (p_sw->rank == (__osm_ftree_fabric_get_rank(p_ftree) - 1))
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_downgoing_by_going_up: "
                  " - Routing SECONDARY path for LID 0x%x: %s --> %s\n",
                 cl_ntoh16(target_lid),
@@ -2087,7 +2111,6 @@ __osm_ftree_fabric_route_downgoing_by_go
             FALSE);      /* whether this is path to HCA that should by tracked by counters */
    }
 
-   OSM_LOG_EXIT(&(osm.log));
 } /* ftree_fabric_route_downgoing_by_going_up() */
 
 /***************************************************/
@@ -2114,7 +2137,7 @@ __osm_ftree_fabric_route_to_hcas(
    uint32_t             j;
    ib_net16_t           remote_lid;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_hcas);
 
    /* for each leaf switch (in indexing order) */
    for(i = 0; i < p_ftree->leaf_switches_num; i++)
@@ -2133,7 +2156,7 @@ __osm_ftree_fabric_route_to_hcas(
          __osm_ftree_sw_set_fwd_table_block(p_sw,
                                             cl_ntoh16(remote_lid),
                                             p_port->port_num);
-         osm_log(&osm.log, OSM_LOG_DEBUG,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
                  "__osm_ftree_fabric_route_to_hcas: "
                  "Switch %s: set path to HCA LID 0x%x through port %u\n",
                  __osm_ftree_tuple_to_str(p_sw->tuple),
@@ -2154,7 +2177,7 @@ __osm_ftree_fabric_route_to_hcas(
 
       if (p_ftree->max_hcas_per_leaf > p_sw->down_port_groups_num)
       {
-         osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
                  "Routing %u dummy HCAs\n",
                  p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
          for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++)
@@ -2171,7 +2194,7 @@ __osm_ftree_fabric_route_to_hcas(
       }
    }
    /* done going through all the leaf switches */
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_route_to_hcas() */
 
 /***************************************************/
@@ -2195,7 +2218,7 @@ __osm_ftree_fabric_route_to_switches(
    ftree_sw_t         * p_sw;
    ftree_sw_t         * p_next_sw;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_route_to_switches);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_route_to_switches);
 
    p_next_sw = (ftree_sw_t *)cl_qmap_head(&p_ftree->sw_tbl);
    while( p_next_sw != (ftree_sw_t *)cl_qmap_end(&p_ftree->sw_tbl) )
@@ -2208,7 +2231,8 @@ __osm_ftree_fabric_route_to_switches(
                                          cl_ntoh16(p_sw->base_lid),
                                          0);
 
-      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_switches: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_fabric_route_to_switches: "
               "Switch %s (LID 0x%x): routing switch-to-switch pathes\n",
               __osm_ftree_tuple_to_str(p_sw->tuple),
               cl_ntoh16(p_sw->base_lid));
@@ -2222,7 +2246,7 @@ __osm_ftree_fabric_route_to_switches(
             FALSE);         /* whether this path should by tracked by counters */
    }
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
 } /* __osm_ftree_fabric_route_to_switches() */
 
 /***************************************************
@@ -2234,18 +2258,17 @@ __osm_ftree_fabric_populate_switches(
 {
    osm_switch_t * p_osm_sw;
    osm_switch_t * p_next_osm_sw;
-   osm_opensm_t * p_osm = &osm;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_switches);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_switches);
 
-   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_osm->subn.sw_guid_tbl);
-   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_osm->subn.sw_guid_tbl) )
+   p_next_osm_sw = (osm_switch_t *)cl_qmap_head(&p_ftree->p_osm->subn.sw_guid_tbl);
+   while( p_next_osm_sw != (osm_switch_t *)cl_qmap_end(&p_ftree->p_osm->subn.sw_guid_tbl) )
    {
       p_osm_sw = p_next_osm_sw;
       p_next_osm_sw = (osm_switch_t *)cl_qmap_next(&p_osm_sw->map_item );
       __osm_ftree_fabric_add_sw(p_ftree,p_osm_sw);
    }
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return 0;
 } /* __osm_ftree_fabric_populate_switches() */
 
@@ -2258,12 +2281,11 @@ __osm_ftree_fabric_populate_hcas(
 {
    osm_node_t   * p_osm_node;
    osm_node_t   * p_next_osm_node;
-   osm_opensm_t * p_osm = &osm;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_hcas);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_hcas);
 
-   p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_osm->subn.node_guid_tbl);
-   while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_osm->subn.node_guid_tbl) )
+   p_next_osm_node = (osm_node_t *)cl_qmap_head(&p_ftree->p_osm->subn.node_guid_tbl);
+   while( p_next_osm_node != (osm_node_t *)cl_qmap_end(&p_ftree->p_osm->subn.node_guid_tbl) )
    {
       p_osm_node = p_next_osm_node;
       p_next_osm_node = (osm_node_t *)cl_qmap_next(&p_osm_node->map_item);
@@ -2278,16 +2300,17 @@ __osm_ftree_fabric_populate_hcas(
             /* all the switches added separately */
             break;
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_populate_hcas: ERR AB0E: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_populate_hcas: ERR AB0E: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_osm_node)));
-            OSM_LOG_EXIT(&(osm.log));
+            OSM_LOG_EXIT(&p_ftree->p_osm->log);
             return -1;
       }
    }
 
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return 0;
 } /* __osm_ftree_fabric_populate_hcas() */
 
@@ -2372,7 +2395,7 @@ __osm_ftree_rank_switches_from_hca(
    static uint16_t i = 0;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_rank_switches_from_hca);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca);
 
    for (i = 0; i < osm_node_get_num_physp(p_osm_node); i++)
    {
@@ -2388,7 +2411,8 @@ __osm_ftree_rank_switches_from_hca(
       {
          case IB_NODE_TYPE_CA:
             /* HCA connected directly to another HCA - not FatTree */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB0F: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_rank_switches_from_hca: ERR AB0F: "
                     "HCA conected directly to another HCA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_hca->p_osm_node)),
@@ -2405,7 +2429,8 @@ __osm_ftree_rank_switches_from_hca(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_rank_switches_from_hca: ERR AB10: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_rank_switches_from_hca: ERR AB10: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(osm_node_get_node_guid(p_remote_osm_node)),
                     ib_get_node_type_str(osm_node_get_type(p_remote_osm_node)));
@@ -2423,7 +2448,8 @@ __osm_ftree_rank_switches_from_hca(
       if (__osm_ftree_sw_ranked(p_sw) && p_sw->rank == 0)
          continue;
 
-      osm_log(&osm.log, OSM_LOG_DEBUG,"__osm_ftree_rank_switches_from_hca: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,
+              "__osm_ftree_rank_switches_from_hca: "
               "Marking rank of switch that is directly connected to HCA:\n"
               "                                            - HCA guid   : 0x%016" PRIx64 "\n"
               "                                            - Switch guid: 0x%016" PRIx64 "\n"
@@ -2435,7 +2461,7 @@ __osm_ftree_rank_switches_from_hca(
    }
 
  Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_rank_switches_from_hca() */
 
@@ -2495,7 +2521,8 @@ __osm_ftree_fabric_construct_hca_ports(
 
          case IB_NODE_TYPE_CA:
             /* HCA connected directly to another HCA - not FatTree */
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_hca_ports: ERR AB11: "
                     "HCA conected directly to another HCA: "
                     "0x%016" PRIx64 " <---> 0x%016" PRIx64 "\n",
                     cl_ntoh64(osm_node_get_node_guid(p_node)),
@@ -2508,7 +2535,8 @@ __osm_ftree_fabric_construct_hca_ports(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_hca_ports: ERR AB12: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_hca_ports: ERR AB12: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(remote_node_guid),
                     ib_get_node_type_str(remote_node_type));
@@ -2625,7 +2653,8 @@ __osm_ftree_fabric_construct_sw_ports(
             break;
 
          default:
-            osm_log(&osm.log, OSM_LOG_ERROR,"__osm_ftree_fabric_construct_sw_ports: ERR AB13: "
+            osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                    "__osm_ftree_fabric_construct_sw_ports: ERR AB13: "
                     "Node GUID 0x%016" PRIx64 " - Unknown node type: %s\n",
                     cl_ntoh64(remote_node_guid),
                     ib_get_node_type_str(remote_node_type));
@@ -2646,6 +2675,10 @@ __osm_ftree_fabric_construct_sw_ports(
             remote_node_type,                           /* remote node type */           
             p_remote_hca_or_sw,                         /* remote ftree_hca/sw object */ 
             direction);                                 /* port direction (up or down) */
+
+      /* Track the max lid (in host order) that exists in the fabric */
+      if (cl_ntoh16(remote_base_lid) > p_ftree->lft_max_lid_ho)
+         p_ftree->lft_max_lid_ho = cl_ntoh16(remote_base_lid);
    }
 
  Exit:
@@ -2665,7 +2698,7 @@ __osm_ftree_fabric_perform_ranking(
    ftree_hca_t * p_next_hca;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_perform_ranking);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_perform_ranking);
 
    /* Mark REVERSED rank of all the switches in the subnet. 
       Start from switches that are connected to hca's, and 
@@ -2678,7 +2711,8 @@ __osm_ftree_fabric_perform_ranking(
       if (__osm_ftree_rank_switches_from_hca(p_ftree,p_hca) != 0)
       {
          res = -1;
-         osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB14: "
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR,
+                 "__osm_ftree_fabric_perform_ranking: ERR AB14: "
                  "Subnet ranking failed - subnet is not FatTree");
          goto Exit;
       }
@@ -2686,7 +2720,8 @@ __osm_ftree_fabric_perform_ranking(
 
    /* calculate and set FatTree rank */
    __osm_ftree_fabric_calculate_rank(p_ftree);
-   osm_log(&osm.log, OSM_LOG_INFO,"__osm_ftree_fabric_perform_ranking: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_INFO,
+           "__osm_ftree_fabric_perform_ranking: "
            "FatTree rank is %u\n", __osm_ftree_fabric_get_rank(p_ftree));
    
    /* fix ranking of the switches by reversing the ranking direction */
@@ -2695,7 +2730,8 @@ __osm_ftree_fabric_perform_ranking(
    if ( __osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK ||
         __osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK )
    {
-      osm_log(&osm.log, OSM_LOG_ERROR, "__osm_ftree_fabric_perform_ranking: ERR AB15: "
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_ERROR, 
+              "__osm_ftree_fabric_perform_ranking: ERR AB15: "
               "Tree rank is %u (should be between %u and %u)\n",
               __osm_ftree_fabric_get_rank(p_ftree),
               FAT_TREE_MIN_RANK,
@@ -2705,7 +2741,7 @@ __osm_ftree_fabric_perform_ranking(
    }
 
   Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_perform_ranking() */
 
@@ -2722,7 +2758,7 @@ __osm_ftree_fabric_populate_ports(
    ftree_sw_t * p_next_sw;
    int res = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_fabric_populate_ports);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_fabric_populate_ports);
 
    p_next_hca = (ftree_hca_t *)cl_qmap_head(&p_ftree->hca_tbl);
    while( p_next_hca != (ftree_hca_t *)cl_qmap_end( &p_ftree->hca_tbl ) )
@@ -2748,7 +2784,7 @@ __osm_ftree_fabric_populate_ports(
       }
    }
  Exit:
-   OSM_LOG_EXIT(&(osm.log));
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
    return res;
 } /* __osm_ftree_fabric_populate_ports() */
 
@@ -2756,58 +2792,61 @@ __osm_ftree_fabric_populate_ports(
  ***************************************************/
 
 static int 
-__osm_ftree_do_routing(void *context)
+__osm_ftree_construct_fabric(
+   IN  void * context)
 {
    ftree_fabric_t * p_ftree = context;
    int status = 0;
 
-   OSM_LOG_ENTER(&(osm.log), __osm_ftree_do_routing);
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_construct_fabric);
 
-   if ( cl_qmap_count(&osm.subn.sw_guid_tbl) < 2 )
+   if ( cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl) < 2 )
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u switches - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
-              cl_qmap_count(&osm.subn.sw_guid_tbl));
+              cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl));
       status = -1;
       goto Exit;
    }
 
-   if ( (cl_qmap_count(&osm.subn.node_guid_tbl) - 
-         cl_qmap_count(&osm.subn.sw_guid_tbl)) < 2)
+   if ( (cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl) - 
+         cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl)) < 2)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u nodes (%u switches) - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
-              cl_qmap_count(&osm.subn.node_guid_tbl),
-              cl_qmap_count(&osm.subn.sw_guid_tbl));
+              cl_qmap_count(&p_ftree->p_osm->subn.node_guid_tbl),
+              cl_qmap_count(&p_ftree->p_osm->subn.sw_guid_tbl));
       status = -1;
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
-           "                       |------------------------------|\n"
-           "                       |-  Starting FatTree Routing  -|\n"
-           "                       |------------------------------|\n\n");
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_construct_fabric: \n"
+           "                       |----------------------------------------|\n"
+           "                       |- Starting FatTree fabric construction -|\n"
+           "                       |----------------------------------------|\n\n");
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating FatTree switch table\n");
    /* ToDo: now that the pointer from node to switch exists,  
       no need to fill the switch table in a separate loop */
    if (__osm_ftree_fabric_populate_switches(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not fat-tree - "
               "falling back to default routing\n");
       status = -1;
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating FatTree HCA table\n");
    if (__osm_ftree_fabric_populate_hcas(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not fat-tree - "
               "falling back to default routing\n");
       status = -1;
@@ -2816,7 +2855,7 @@ __osm_ftree_do_routing(void *context)
 
    if (cl_qmap_count(&p_ftree->hca_tbl) < 2)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric has %u HCAa - topology is not fat-tree.\n"
               "Falling back to default routing.\n",
               cl_qmap_count(&p_ftree->hca_tbl));
@@ -2825,12 +2864,13 @@ __osm_ftree_do_routing(void *context)
    }
 
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Ranking FatTree\n");
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: Ranking FatTree\n");
+
    if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
    {
       if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
-         osm_log(&osm.log, OSM_LOG_SYS,
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
                  "Fabric rank is %u (>%u) - "
                  "fat-tree routing falls back to default routing\n",
                  __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK);
@@ -2841,11 +2881,12 @@ __osm_ftree_do_routing(void *context)
    /* For each hca and switch, construct array of ports.
       This is done after the whole FatTree data structure is ready, because
       we want the ports to have pointers to ftree_{sw,hca}_t objects.*/
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
            "Populating HCA & switch ports\n");
    if (__osm_ftree_fabric_populate_ports(p_ftree) != 0)
    {
-      osm_log(&osm.log, OSM_LOG_SYS,
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
               "Fabric topology is not a fat-tree - "
               "routing falls back to default routing\n");
       status = -1;
@@ -2863,7 +2904,7 @@ __osm_ftree_do_routing(void *context)
    __osm_ftree_fabric_dump_general_info(p_ftree);
 
    /* dump full tree topology */
-   if (osm_log_is_active(&osm.log, OSM_LOG_DEBUG))
+   if (osm_log_is_active(&p_ftree->p_osm->log, OSM_LOG_DEBUG))
        __osm_ftree_fabric_dump(p_ftree);
 
    if (! __osm_ftree_fabric_validate_topology(p_ftree))
@@ -2872,46 +2913,118 @@ __osm_ftree_do_routing(void *context)
       goto Exit;
    }
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: "
+           "Max LID in switch LFTs (in host order): 0x%x\n",
+           p_ftree->lft_max_lid_ho);
+
+ Exit:
+   if (status != 0)
+   {
+      osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+              "__osm_ftree_construct_fabric: "
+             "Clearing FatTree Fabric data structures\n");
+     __osm_ftree_fabric_clear(p_ftree);
+   }
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,
+           "__osm_ftree_construct_fabric: \n"
+           "                       |--------------------------------------------------|\n"
+           "                       |- Done constructing FatTree fabric (status = %d) -|\n"
+           "                       |--------------------------------------------------|\n\n",
+           status);
+
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return status;
+}
+
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_do_routing(
+   IN  void * context)
+{
+   ftree_fabric_t * p_ftree = context;
+
+   OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_do_routing);
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "Starting FatTree routing\n");
+
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
            "Filling switch forwarding tables for routes to HCAs\n");
    __osm_ftree_fabric_route_to_hcas(p_ftree);
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
            "Filling switch forwarding tables for switch-to-switch pathes\n");
    __osm_ftree_fabric_route_to_switches(p_ftree);
 
    /* for each switch, set its fwd table */
-   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, NULL);
+   cl_qmap_apply_func(&p_ftree->sw_tbl, __osm_ftree_set_sw_fwd_table, (void *)p_ftree);
 
    /* write out hca ordering file */
    __osm_ftree_fabric_dump_hca_ordering(p_ftree);
 
- Exit:
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
-           "Clearing FatTree Fabric data structures\n");
-   __osm_ftree_fabric_clear(p_ftree);
+   osm_log(&p_ftree->p_osm->log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: "
+           "FatTree routing is done\n");
 
-   osm_log(&osm.log, OSM_LOG_VERBOSE,"__osm_ftree_do_routing: \n"
-           "                       |---------------------------------------|\n"
-           "                       |-  Done FatTree Routing (status = %d)  -|\n"
-           "                       |---------------------------------------|\n\n", status);
+   OSM_LOG_EXIT(&p_ftree->p_osm->log);
+   return 0;
+}
 
-   OSM_LOG_EXIT(&(osm.log));
-   return status;
+/***************************************************
+ ***************************************************/
+
+static int 
+__osm_ftree_routing(
+   IN  void * context)
+{
+   int status = __osm_ftree_construct_fabric(context);
+   if (status != 0)
+      return status;
+
+   __osm_ftree_do_routing(context);
+   return 0;
 }
 
 /***************************************************
  ***************************************************/
 
+void
+ucast_mgr_dump_to_file(
+   IN  osm_ucast_mgr_t *p_mgr,
+   IN  const char *file_name,
+   IN  void (*func)(cl_map_item_t *, void *));
+
+void
+ucast_mgr_dump_lfts(
+   IN  cl_map_item_t *p_map_item,
+   void *cxt);
+
 static void 
-__osm_ftree_delete(void * context)
+__osm_ftree_dump_tables(
+   IN  void * context)
 {
-   ftree_fabric_t * p_ftree = (ftree_fabric_t *)context;
+   ftree_fabric_t * p_ftree = context;
    if (!p_ftree)
       return;
 
-   __osm_ftree_fabric_destroy(p_ftree);
+   ucast_mgr_dump_to_file(&p_ftree->p_osm->sm.ucast_mgr,
+                          "opensm-lfts.dump",
+                          ucast_mgr_dump_lfts);
+}
 
+/***************************************************
+ ***************************************************/
+
+static void 
+__osm_ftree_delete(
+   IN  void * context)
+{
+   if (!context)
+      return;
+   __osm_ftree_fabric_destroy((ftree_fabric_t *)context);
 }
 
 /***************************************************
@@ -2923,11 +3036,21 @@ int osm_ucast_ftree_setup(osm_opensm_t *
    if (!p_ftree)
       return -1;
 
+   p_ftree->p_osm = p_osm;
+
    p_osm->routing_engine.context = (void *)p_ftree;
-   p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing;
+   p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_routing;
+   /* ToDo: Resolve multicast routing. 
+    *       Until then lid matrices are built, despite the
+    *       fact that FatTree routing doesn't need them.
+    *       When the multicast routing will be resolved,
+    *       __osm_ftree_routing() function should be removed,
+    *       and here's how the FatTree routing will be set:
+    *  p_osm->routing_engine.build_lid_matrices = __osm_ftree_construct_fabric;
+    *  p_osm->routing_engine.ucast_build_fwd_tables = __osm_ftree_do_routing;
+    */
+   p_osm->routing_engine.ucast_dump_tables = __osm_ftree_dump_tables;
    p_osm->routing_engine.delete = __osm_ftree_delete;
-   /* ToDo: fat-tree routing doesn't use min_hop tables, so we
-      shouldn't fill them (p_osm->routing_engine.build_lid_matrices) */
    return 0;
 }
 
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 20 01:09:09 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 11:09:09 +0200
Subject: [openib-general] Routing patches
Message-ID: <4588FDB5.6080909@dev.mellanox.co.il>

Hi Hal.

Just wanted to put some order in the routing-related patches.

There are four patches that are waiting to be reviewed and applied:

1. Added an option for providing dump function per routing engine
2. [v2] Adding max_lid_ho field to osm_switch_t
3. [v2] Improving FatTree routing engine
4. Added FatTree routing to the osm manual

Thanks.

-- Yevgeny


From ogerlitz at voltaire.com  Wed Dec 20 01:35:49 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 20 Dec 2006 11:35:49 +0200
Subject: [openib-general] Performance Degradation with OFED v.
	Voltaire(lustre)
In-Reply-To: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net>
References: <3F3894AC7A13B04E83CEBC95CFD3047E055B2069@idaexc03.emea.cpqcorp.net>
Message-ID: <458903F5.8030905@voltaire.com>

Bernadat, Philippe wrote:
> I checked. We apparently never go through this path (with lustre) 

Philippe,

Lustre's openib nld (o2ibnld) always go through

rdma_resolve_route --> cma_resolve_ib_route --> cma_query_ib_route

!!!

please add a sanity check printk to the __init code of the rdma_cm 
module @ drivers/infiniband/core/cma.c to see that the code you are 
working on is actually loaded into the kernel

But this will not help you, the Voltaire SM/SA that you are using will 
not return you 1K MTU based on the fixed cma-tavor-quirk patch that 
Michael has sent. This is actually correct also for the Open SM/SA when 
it does not apply a tavor quirk of its own...

So basically, for the time being please patch Lustre o2ibnal to set the 
MTU to 1K (either always or under some mod param whose default is true), 
till the issue is discussed and decided over this list.

Per the best of my knowledge (Mellanox people please correct me if i am 
wrong): basically if you use 2K MTU for IB RC with MLX/Tavor you get 50% 
BW drop, and if you use 2K MTU for IB/RC with MLX/Arble or Sinai you get 
5% BW increase. And the BW drop problem holds if either of the parties 
is tavor.

Thanks for pointing on the problem and raising the issue!

Or.


From halr at voltaire.com  Wed Dec 20 04:33:10 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 07:33:10 -0500
Subject: [openib-general] [PATCH] osm: added an option for providing
 dump function per routing engine
In-Reply-To: <45883F79.6090109@dev.mellanox.co.il>
References: <45883F79.6090109@dev.mellanox.co.il>
Message-ID: <1166617989.4519.40648.camel@hal.voltaire.com>

Hi Yevgeny,

On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> As you suggested, added an option for providing dump 
> function per routing engine.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks, Applied.

One minor question below:

>  osm/include/opensm/osm_opensm.h |    4 ++++
>  osm/opensm/osm_ucast_mgr.c      |   23 ++++++++++++++---------
>  2 files changed, 18 insertions(+), 9 deletions(-)

[snip...]

> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index e051c66..fcf6f72 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c

[snip...]

> @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process(
>      build and download the switch forwarding tables.
>    */
>  
> -  if (!p_routing_eng->ucast_build_fwd_tables ||
> -      p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0)
> -  {
> -    cl_qmap_apply_func( p_sw_guid_tbl,
> -                        __osm_ucast_mgr_process_tbl, p_mgr );
> -  }
> +  if ( p_routing_eng->ucast_build_fwd_tables && 
> +       (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) )
> +     default_routing = FALSE;
> +  else
> +     cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr );
>  
>    /* dump fdb into file: */
>    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
> -    __osm_ucast_mgr_dump_tables( p_mgr );
> +  {
> +     if ( !default_routing && p_routing_eng->ucast_dump_tables )
> +        p_routing_eng->ucast_dump_tables(p_routing_eng->context);
> +     else
> +        __osm_ucast_mgr_dump_tables( p_mgr );
> +  }

Not sure if this is best going forward. Should it be like this:

	if ( default_routing )
	  __osm_ucast_mgr_dump_tables( p_mgr );
	else
	{
	  if ( p_routing_eng->ucast_dump_tables != 0 )
	    p_routing_eng->ucast_dump_tables(p_routing_eng->context);
	}
-- Hal


From halr at voltaire.com  Wed Dec 20 04:49:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 07:49:23 -0500
Subject: [openib-general] [PATCHv2] osm: adding max_lid_ho field to
	osm_switch_t
In-Reply-To: <4588F9AC.5040401@dev.mellanox.co.il>
References: <4588F9AC.5040401@dev.mellanox.co.il>
Message-ID: <1166618113.4519.40761.camel@hal.voltaire.com>

On Wed, 2006-12-20 at 03:51, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> [V2 of the patch - removed setter and unnecessary initialization]
> 
> Adding max_lid_ho field to osm_switch_t to allow routing
> engines that don't use lid matrices to explicitly set the
> max lid (in host order) that is reachable from the switch.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From kliteyn at dev.mellanox.co.il  Wed Dec 20 05:49:52 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 15:49:52 +0200
Subject: [openib-general] [PATCH] osm: added an option for providing
 dump function per routing engine
In-Reply-To: <1166617989.4519.40648.camel@hal.voltaire.com>
References: <45883F79.6090109@dev.mellanox.co.il>
	<1166617989.4519.40648.camel@hal.voltaire.com>
Message-ID: <45893F80.6060901@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> As you suggested, added an option for providing dump 
>> function per routing engine.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Thanks, Applied.
> 
> One minor question below:
> 
>>  osm/include/opensm/osm_opensm.h |    4 ++++
>>  osm/opensm/osm_ucast_mgr.c      |   23 ++++++++++++++---------
>>  2 files changed, 18 insertions(+), 9 deletions(-)
> 
> [snip...]
> 
>> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
>> index e051c66..fcf6f72 100644
>> --- a/osm/opensm/osm_ucast_mgr.c
>> +++ b/osm/opensm/osm_ucast_mgr.c
> 
> [snip...]
> 
>> @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process(
>>      build and download the switch forwarding tables.
>>    */
>>  
>> -  if (!p_routing_eng->ucast_build_fwd_tables ||
>> -      p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0)
>> -  {
>> -    cl_qmap_apply_func( p_sw_guid_tbl,
>> -                        __osm_ucast_mgr_process_tbl, p_mgr );
>> -  }
>> +  if ( p_routing_eng->ucast_build_fwd_tables && 
>> +       (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) )
>> +     default_routing = FALSE;
>> +  else
>> +     cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr );
>>  
>>    /* dump fdb into file: */
>>    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
>> -    __osm_ucast_mgr_dump_tables( p_mgr );
>> +  {
>> +     if ( !default_routing && p_routing_eng->ucast_dump_tables )
>> +        p_routing_eng->ucast_dump_tables(p_routing_eng->context);
>> +     else
>> +        __osm_ucast_mgr_dump_tables( p_mgr );
>> +  }
> 
> Not sure if this is best going forward. Should it be like this:
> 
> 	if ( default_routing )
> 	  __osm_ucast_mgr_dump_tables( p_mgr );
> 	else
> 	{
> 	  if ( p_routing_eng->ucast_dump_tables != 0 )
> 	    p_routing_eng->ucast_dump_tables(p_routing_eng->context);
> 	}

But then what if I have some routing engine that wants to use 
default dump functions, like updn?

So in my approach is as follows:
 - If a routing engine wants to use default dump functions, 
   it should *not* define any dump function of its own.
 - If a routing engine does *not* want to dump anything, it 
   should define a dummy dump function of its own.

You're suggesting the following:
 - If a routing engine wants to use default dump functions, 
   it should define dump function that will call default function.
 - If a routing engine does *not* want to dump anything, it 
   should *not* define any dump function of its own.

I'm OK with both approaches - your call.

-- Yevgeny.

> -- Hal
> 
 

From vlad at dev.mellanox.co.il  Wed Dec 20 06:22:06 2006
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 20 Dec 2006 16:22:06 +0200
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B9A943@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B9A943@xmb-sjc-216.amer.cisco.com>
Message-ID: <4589470E.1020105@dev.mellanox.co.il>

Scott Weitzenkamp (sweitzen) wrote:
>> Meeting summary:
>> *1. Daily build update:*
>> Daily build is now based on kernel 2.6.20-rc1.
>>     
>
> Where is the daily build?
>
> Scott
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   
User: http://staging.openfabrics.org/builds/ofa_1_2_user/
Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/

But I see that staging website is down for some reason...

Regards,
Vladimir


From halr at voltaire.com  Wed Dec 20 06:29:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:29:19 -0500
Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine
In-Reply-To: <4588FA5A.1070802@dev.mellanox.co.il>
References: <4588FA5A.1070802@dev.mellanox.co.il>
Message-ID: <1166624959.4519.46181.camel@hal.voltaire.com>

Hi Yevgeny,

On Wed, 2006-12-20 at 03:54, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> [V2 of the patch - not using max_lid_ho setter]
> 
> FatTree routing engine improvemets:
> 1. Improved building of LFTs
> 2. Setting max lid on osm switches
> 3. Using ucast manager LFT dump function
> 4. Stoped using global variable 'osm'
> 5. Improved logging
> 6. Some cosmetics

In general, it should be one "thought" per patch but since this is so
new I will incorporate this all in one patch.

> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

One minor comment below.

> ---
>  osm/opensm/osm_ucast_ftree.c |  439 +++++++++++++++++++++++++++---------------
>  1 files changed, 281 insertions(+), 158 deletions(-)
> 
> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
> index 15e4cd0..0d7188a 100644
> --- a/osm/opensm/osm_ucast_ftree.c
> +++ b/osm/opensm/osm_ucast_ftree.c

[snip...]

> +void
> +ucast_mgr_dump_to_file(
> +   IN  osm_ucast_mgr_t *p_mgr,
> +   IN  const char *file_name,
> +   IN  void (*func)(cl_map_item_t *, void *));
> +
> +void
> +ucast_mgr_dump_lfts(
> +   IN  cl_map_item_t *p_map_item,
> +   void *cxt);
> +

Rather than declaring these here, should these go into osm_ucast_mgr.h ?

-- Hal


From halr at voltaire.com  Wed Dec 20 06:33:41 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:33:41 -0500
Subject: [openib-general] Routing patches
In-Reply-To: <4588FDB5.6080909@dev.mellanox.co.il>
References: <4588FDB5.6080909@dev.mellanox.co.il>
Message-ID: <1166624963.4519.46185.camel@hal.voltaire.com>

Hi Yevgeny,

On Wed, 2006-12-20 at 04:09, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Just wanted to put some order in the routing-related patches.
> 
> There are four patches that are waiting to be reviewed and applied:
> 
> 1. Added an option for providing dump function per routing engine
> 2. [v2] Adding max_lid_ho field to osm_switch_t
> 3. [v2] Improving FatTree routing engine
> 4. Added FatTree routing to the osm manual

All completed now. Thanks.

-- Hal

> Thanks.
> 
> -- Yevgeny


From halr at voltaire.com  Wed Dec 20 06:33:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:33:36 -0500
Subject: [openib-general] [PATCH] osm: Added FatTree routing to the osm
	manual
In-Reply-To: <45885D14.4090200@dev.mellanox.co.il>
References: <45885D14.4090200@dev.mellanox.co.il>
Message-ID: <1166624961.4519.46183.camel@hal.voltaire.com>

On Tue, 2006-12-19 at 16:43, Yevgeny Kliteynik wrote:
> Added FatTree routing to the osm manual
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

>  osm/man/opensm.8 |   8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/osm/man/opensm.8 b/osm/man/opensm.8
> index 316232d..225918d 100644
> --- a/osm/man/opensm.8
> +++ b/osm/man/opensm.8
> @@ -391,7 +391,7 @@ Examples:
>  
>  .SH ROUTING
>  .PP
> -OpenSM offers two routing engines:
> +OpenSM offers three routing engines:
>  
>  1.  Min Hop Algorithm - based on the minimum hops to each node where the
>  path length is optimized.
> @@ -401,6 +401,12 @@ node, but it is constrained to ranking r
>  if the subnet is not a pure Fat Tree, and deadlock may occur due to a
>  loop in the subnet.
>  
> +3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing 
> +for congestion-free "shift" communication pattern.
> +It should be chosen if a subnet is a symmetrical Fat Trees of various types,
> +not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
> +Similar to UPDN, Fat Tree routing is constrained to ranking rules.

Is there a reference or a more complete writeup of what it does ? See
the descriptions of the other algorithms for what I'm referring to here.

-- Hal

>  OpenSM also supports a file method which can load routes from a table. See
>  \'Modular Routing Engine\' for more information on this.


From halr at voltaire.com  Wed Dec 20 06:36:55 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:36:55 -0500
Subject: [openib-general] [PATCH] osm: added an option for providing
 dump function per routing engine
In-Reply-To: <45893F80.6060901@dev.mellanox.co.il>
References: <45883F79.6090109@dev.mellanox.co.il>
	<1166617989.4519.40648.camel@hal.voltaire.com>
	<45893F80.6060901@dev.mellanox.co.il>
Message-ID: <1166625414.4519.46593.camel@hal.voltaire.com>

Hi Yevgeny,

On Wed, 2006-12-20 at 08:49, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > On Tue, 2006-12-19 at 14:37, Yevgeny Kliteynik wrote:
> >> Hi Hal
> >>
> >> As you suggested, added an option for providing dump 
> >> function per routing engine.
> >>
> >> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> > 
> > Thanks, Applied.
> > 
> > One minor question below:
> > 
> >>  osm/include/opensm/osm_opensm.h |    4 ++++
> >>  osm/opensm/osm_ucast_mgr.c      |   23 ++++++++++++++---------
> >>  2 files changed, 18 insertions(+), 9 deletions(-)
> > 
> > [snip...]
> > 
> >> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> >> index e051c66..fcf6f72 100644
> >> --- a/osm/opensm/osm_ucast_mgr.c
> >> +++ b/osm/opensm/osm_ucast_mgr.c
> > 
> > [snip...]
> > 
> >> @@ -1256,16 +1257,20 @@ osm_ucast_mgr_process(
> >>      build and download the switch forwarding tables.
> >>    */
> >>  
> >> -  if (!p_routing_eng->ucast_build_fwd_tables ||
> >> -      p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) != 0)
> >> -  {
> >> -    cl_qmap_apply_func( p_sw_guid_tbl,
> >> -                        __osm_ucast_mgr_process_tbl, p_mgr );
> >> -  }
> >> +  if ( p_routing_eng->ucast_build_fwd_tables && 
> >> +       (p_routing_eng->ucast_build_fwd_tables(p_routing_eng->context) == 0) )
> >> +     default_routing = FALSE;
> >> +  else
> >> +     cl_qmap_apply_func( p_sw_guid_tbl, __osm_ucast_mgr_process_tbl, p_mgr );
> >>  
> >>    /* dump fdb into file: */
> >>    if ( osm_log_is_active( p_mgr->p_log, OSM_LOG_ROUTING ) )
> >> -    __osm_ucast_mgr_dump_tables( p_mgr );
> >> +  {
> >> +     if ( !default_routing && p_routing_eng->ucast_dump_tables )
> >> +        p_routing_eng->ucast_dump_tables(p_routing_eng->context);
> >> +     else
> >> +        __osm_ucast_mgr_dump_tables( p_mgr );
> >> +  }
> > 
> > Not sure if this is best going forward. Should it be like this:
> > 
> > 	if ( default_routing )
> > 	  __osm_ucast_mgr_dump_tables( p_mgr );
> > 	else
> > 	{
> > 	  if ( p_routing_eng->ucast_dump_tables != 0 )
> > 	    p_routing_eng->ucast_dump_tables(p_routing_eng->context);
> > 	}
> 
> But then what if I have some routing engine that wants to use 
> default dump functions, like updn?
> 
> So in my approach is as follows:
>  - If a routing engine wants to use default dump functions, 
>    it should *not* define any dump function of its own.
>  - If a routing engine does *not* want to dump anything, it 
>    should define a dummy dump function of its own.
> 
> You're suggesting the following:
>  - If a routing engine wants to use default dump functions, 
>    it should define dump function that will call default function.
>  - If a routing engine does *not* want to dump anything, it 
>    should *not* define any dump function of its own.
> 
> I'm OK with both approaches - your call.

You're right. It's 6 of one half a dozen of another. Let's leave it
alone.

-- Hal

> -- Yevgeny.
> 
> > -- Hal
> > 
>  


From halr at voltaire.com  Wed Dec 20 06:38:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:38:18 -0500
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
In-Reply-To: <4589470E.1020105@dev.mellanox.co.il>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302B9A943@xmb-sjc-216.amer.cisco.com>
	<4589470E.1020105@dev.mellanox.co.il>
Message-ID: <1166625497.4519.46653.camel@hal.voltaire.com>

On Wed, 2006-12-20 at 09:22, Vladimir Sokolovsky wrote:
> But I see that staging website is down for some reason...

http/https appear to be not working. "staging" is up though.

-- Hal


From kliteyn at dev.mellanox.co.il  Wed Dec 20 06:42:29 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 16:42:29 +0200
Subject: [openib-general] [PATCHv2] osm: improving FatTree routing engine
In-Reply-To: <1166624959.4519.46181.camel@hal.voltaire.com>
References: <4588FA5A.1070802@dev.mellanox.co.il>
	<1166624959.4519.46181.camel@hal.voltaire.com>
Message-ID: <45894BD5.7040808@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> On Wed, 2006-12-20 at 03:54, Yevgeny Kliteynik wrote:
>> Hi Hal
>>
>> [V2 of the patch - not using max_lid_ho setter]
>>
>> FatTree routing engine improvemets:
>> 1. Improved building of LFTs
>> 2. Setting max lid on osm switches
>> 3. Using ucast manager LFT dump function
>> 4. Stoped using global variable 'osm'
>> 5. Improved logging
>> 6. Some cosmetics
> 
> In general, it should be one "thought" per patch 
> but since this is so new
 
This is the reason why all these changes are in one patch - they 
are not appearing in the code in an incremental manner, but rather
as a bunch of changes all over the code.

> I will incorporate this all in one patch.
> 
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Thanks. Applied.
> 
> One minor comment below.
> 
>> ---
>>  osm/opensm/osm_ucast_ftree.c |  439 +++++++++++++++++++++++++++---------------
>>  1 files changed, 281 insertions(+), 158 deletions(-)
>>
>> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
>> index 15e4cd0..0d7188a 100644
>> --- a/osm/opensm/osm_ucast_ftree.c
>> +++ b/osm/opensm/osm_ucast_ftree.c
> 
> [snip...]
> 
>> +void
>> +ucast_mgr_dump_to_file(
>> +   IN  osm_ucast_mgr_t *p_mgr,
>> +   IN  const char *file_name,
>> +   IN  void (*func)(cl_map_item_t *, void *));
>> +
>> +void
>> +ucast_mgr_dump_lfts(
>> +   IN  cl_map_item_t *p_map_item,
>> +   void *cxt);
>> +
> 
> Rather than declaring these here, should these go into osm_ucast_mgr.h ?
 
I thought about it, but was reluctant to do it because osm_ucast_mgr.h contains
only "important" functions. But now that the dump function is one of the routing
engine capabilities, I guess you're right - it's better to declare these functions
in the header file.

-- Yevgeny

> -- Hal
> 


From halr at voltaire.com  Wed Dec 20 06:47:31 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 09:47:31 -0500
Subject: [openib-general] Minor question on fat tree routing
Message-ID: <1166626050.4519.47111.camel@hal.voltaire.com>

Hi Yevgeny,

Minor question on fat tree routing:

osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code:

   if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
   {
      if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)

Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback
to default routing for this case too ?

Thanks.

-- Hal


From kliteyn at dev.mellanox.co.il  Wed Dec 20 07:02:34 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 17:02:34 +0200
Subject: [openib-general] Minor question on fat tree routing
In-Reply-To: <1166626050.4519.47111.camel@hal.voltaire.com>
References: <1166626050.4519.47111.camel@hal.voltaire.com>
Message-ID: <4589508A.30901@dev.mellanox.co.il>

Hi Hal,

Hal Rosenstock wrote:
> Hi Yevgeny,
> 
> Minor question on fat tree routing:
> 
> osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code:
> 
>    if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
>    {
>       if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
> 
> Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback
> to default routing for this case too ?

This is also checked, but as part of more earlier checks in the same function:
FatTree routing will abort even before ranking the tree and fallback to the default
routing if a fabric has less than 2 switches.

-- Yevgeny
> 
> Thanks.
> 
> -- Hal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Wed Dec 20 07:15:31 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 10:15:31 -0500
Subject: [openib-general] Minor question on fat tree routing
In-Reply-To: <4589508A.30901@dev.mellanox.co.il>
References: <1166626050.4519.47111.camel@hal.voltaire.com>
	<4589508A.30901@dev.mellanox.co.il>
Message-ID: <1166627731.4519.48417.camel@hal.voltaire.com>

On Wed, 2006-12-20 at 10:02, Yevgeny Kliteynik wrote:
> Hi Hal,
> 
> Hal Rosenstock wrote:
> > Hi Yevgeny,
> > 
> > Minor question on fat tree routing:
> > 
> > osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code:
> > 
> >    if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
> >    {
> >       if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
> > 
> > Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback
> > to default routing for this case too ?
> 
> This is also checked, but as part of more earlier checks in the same function:
> FatTree routing will abort even before ranking the tree and fallback to the default
> routing if a fabric has less than 2 switches.

What about 2 or more switches but rank is 1 ? Isn't that possible too ?

-- Hal

> 
> -- Yevgeny
> > 
> > Thanks.
> > 
> > -- Hal
> > 
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > 


From kliteyn at dev.mellanox.co.il  Wed Dec 20 07:32:23 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 20 Dec 2006 17:32:23 +0200
Subject: [openib-general] Minor question on fat tree routing
In-Reply-To: <1166627731.4519.48417.camel@hal.voltaire.com>
References: <1166626050.4519.47111.camel@hal.voltaire.com>
	<4589508A.30901@dev.mellanox.co.il>
	<1166627731.4519.48417.camel@hal.voltaire.com>
Message-ID: <45895787.1080800@dev.mellanox.co.il>


Hal Rosenstock wrote:
> On Wed, 2006-12-20 at 10:02, Yevgeny Kliteynik wrote:
>> Hi Hal,
>>
>> Hal Rosenstock wrote:
>>> Hi Yevgeny,
>>>
>>> Minor question on fat tree routing:
>>>
>>> osm_ucast_ftree.c:__osm_ftree_construct_fabric has the following code:
>>>
>>>    if (__osm_ftree_fabric_perform_ranking(p_ftree) != 0)
>>>    {
>>>       if (__osm_ftree_fabric_get_rank(p_ftree) > FAT_TREE_MAX_RANK)
>>>
>>> Should < FAT_TREE_MIN_RANK also be checked there too ? Does it fallback
>>> to default routing for this case too ?
>> This is also checked, but as part of more earlier checks in the same function:
>> FatTree routing will abort even before ranking the tree and fallback to the default
>> routing if a fabric has less than 2 switches.
> 
> What about 2 or more switches but rank is 1 ? Isn't that possible too ?

2 or more switches and tree rank 1 means that all the switches are leaf switches,
which means that they all connected directly to HCAs.
So either these switches are not connected to each other, which means that we
actually have several disconnected subnets, or they are connected to each other,
which means that they have connections the same rank of the tree, which is illegal 
and is discovered by indexing.

But I agree - adding the (< FAT_TREE_MIN_RANK) check will improve readability.

-- Yevgeny
 
> -- Hal
> 
>> -- Yevgeny
>>> Thanks.
>>>
>>> -- Hal
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>
> 


From sweitzen at cisco.com  Wed Dec 20 08:29:14 2006
From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen))
Date: Wed, 20 Dec 2006 08:29:14 -0800
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
Message-ID: <A15335FBE9BD2449AF2C9EF3D1EB8EA302BFD0ED@xmb-sjc-216.amer.cisco.com>

> > Where is the daily build?
>
> User: http://staging.openfabrics.org/builds/ofa_1_2_user/
> Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/

How do I compile this daily build?

Can I get a daily build that is packaged like the release candidates,
with install.sh?

Scott


From steve.apo at googlemail.com  Wed Dec 20 08:46:43 2006
From: steve.apo at googlemail.com (Steven Wooding)
Date: Wed, 20 Dec 2006 16:46:43 +0000
Subject: [openib-general] RDMA to shared memory causing corruption
Message-ID: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com>

Hi,

I need some advice on a problem I've got RDMAing some data into a shared
memory segment.

Everything works great until I try to transfer a message of 294Kbytes or
larger in size. There is some management info in the top end of the share
memory segment (we're using Boost shm library). This management area gets
corrupted after the RDMA transfer has occurred.

I've tried various things to try and debug this. Allocating more memory than
I need from the shared memory segment for the landing buffer. Making whole
shared memory segment larger, and making the management area smaller. But
always I'm hit by this 294K limit. I don't know whether it's a problem with
Boost shmem or with RDMA writing to memory areas that it shouldn't.

Any help or glues would be great.

Thanks a lot.

Steve.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061220/6f4142d4/attachment.html>

From chrise at sgi.com  Wed Dec 20 08:56:24 2006
From: chrise at sgi.com (Chris Elmquist)
Date: Wed, 20 Dec 2006 10:56:24 -0600
Subject: [openib-general] building and running IBMgtsim?
Message-ID: <20061220165624.GL31149@sgi.com>

Folks,

I am trying to build and run IBMgtsim so that I can explore some different
topologies and system sizes.  But I am having a lot of trouble getting
OpenSM to work with the simulator.

I pulled down Eitan's ibutils git tree (to get the simulator) and
am otherwise using the OFED 1.1 tarball for the rest of the stuff.
I suspect I have a problem with OpenSM not being built correctly to use
the simulator.

Does anyone have a recipe on how to build and install all of these pieces
(ie, openib, openSM and ibmgtsim) so that they will work together?

I have been just trying to run one of the tests provided with the
simulator like this:

% cd ~/ibutils/ibmgtsim/tests
% RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm

but we get this sort of output:

-I- Using random seed:43204
-I- Simulation directory is: /tmp/ibmgtsim.29716
-I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top
o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log
-I- Simulator Ready
-I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726
5 
-I- Connected to the simulator control server
-I- Defined 51 guids
-I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a
 2}
-I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
-I- Waiting for OpenSM subnet up ...
-I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: 
ERR 5422: Unable to find requested CA guid 0x2c90000000009
-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
-I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5
424: Unable to Open Port 0x2c90000000009
-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
-I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: 
ERR 3118: Vendor specific bind failed
-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
-I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10:
 SM MAD Controller bind failed (IB_ERROR)
-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
-I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind
: ERR 1A11: No previous bind
-I- New 1 events of /tmp/ibmgtsim.29716/osm.log

Thank you.

Chris
SGI Network Engineering
-- 
Chris Elmquist          mailto:chrise at sgi.com      (651)683-3093
                        Silicon Graphics, Inc.     Eagan, MN


From vlad at dev.mellanox.co.il  Wed Dec 20 09:27:19 2006
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 20 Dec 2006 19:27:19 +0200
Subject: [openib-general] OFED 1.2 18-Dec meeting summary
In-Reply-To: <A15335FBE9BD2449AF2C9EF3D1EB8EA302BFD0ED@xmb-sjc-216.amer.cisco.com>
References: <A15335FBE9BD2449AF2C9EF3D1EB8EA302BFD0ED@xmb-sjc-216.amer.cisco.com>
Message-ID: <45897277.50406@dev.mellanox.co.il>

Scott Weitzenkamp (sweitzen) wrote:
>>> Where is the daily build?
>>>       
>> User: http://staging.openfabrics.org/builds/ofa_1_2_user/
>> Kernel: http://staging.openfabrics.org/builds/ofa_1_2_kernel/
>>     
>
> How do I compile this daily build?
>
> Can I get a daily build that is packaged like the release candidates,
> with install.sh?
>
> Scott
>   
Download and open tgz files for user and kernel
Execute
        ./configure ... (see --help)
       make
       make install

Example for userspace:
  ./configure --with-libibverbs --with-libmthca --with-libipathverbs  
--with-libibcm  --with-libsdp --with-librdmacm --with-opensm 
--with-openib-diags --with-perftest --with-mstflint --with-srptools 
--with-ipoibtools
     make
     make install

Example for kernel:
    ./configure --with-ipoib-mod --with-sdp-mod --with-srp-mod 
--with-user_mad-mod --with-user_access-mod --with-mthca-mod 
--with-core-mod --with-addr_trans-mod
     make
     make install


Updated: 
https://openib.org/tiki/tiki-index.php?page=OFED+1.2+release+plan+and+features

Regards,
Vladimir


From Ashish.Batwara at lsi.com  Wed Dec 20 10:06:08 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Wed, 20 Dec 2006 11:06:08 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com>

Hi,
Please see the information below

This is what I did:
/etc/init.d/openibd start
/etc/init.d/opensmd  start
modprobe ib_srp

Issued the command /usr/local/ofed/sbin/ibsrpdm -c    to get the
information about target and used them in 

echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
 
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114
6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target

Yes, earlier I had silverstorm switch which was running SM but now I
have taken that out and directly connecting the target and host.

I have only one port connected between the host and the target. 
The reason behind link is not stable is that I am restarting and
stopping again and again, as this does not seem to be working and I did
not know the issue until I looked at the console log which was
indicating "Got failed path rec status -110" and after seeing that I
searched on goggle and found that
"https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht
ml" it seems to be a bug with 64-bit machine.
BTW, my linux server is 64-bit.
When I hooked up 32-bit server running OFED-1.1, I see my target
discovered with the same procedure.

So, whole question is that what is the fix for issue "Got failed path
rec status -110" on 64-bit machine.

Thanks
Ashish

-----Original Message-----
From: Hal Rosenstock [mailto:halr at voltaire.com] 
Sent: Tuesday, December 19, 2006 10:35 PM
To: Batwara, Ashish
Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
Subject: RE: [openib-general] opensm

On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote:
> Hi,
> Please look towards the end of the attached file.

What options are you starting opensm with ? What is the command line ?

Also, it looks like (at least at one point) you have another SM on the
subnet. What is the make (vendor) for your switch ?

I see many SM port is DOWN. What is going on with this port ? Why is the
physical link not LinkUp and stable ? That is the main issue and is
likely why the SubnGet of NodeInfo is not being responded to.

-- Hal

> Thanks
> Ashish
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, December 19, 2006 5:06 PM
> To: Batwara, Ashish
> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> Subject: Re: [openib-general] opensm
> 
> Ashish,
> 
> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> > Hi,
> > 
> > Here is the info that you have asked. I am seeing the Subnet manager
> > is up now having the port active. But server is not able to discover
> > the target. I am seeing the error "Got failed path rec status -110"
on
> > Linux console. 
> 
> That means the request for an SA PathRecord from the initiator to the
> target failed (-110 is ETIMEDOUT). Are you sure the target is up
> (ACTIVE) on the subnet ? If it is, can you send the opensm log ?
> 
> -- Hal
> 
> > Below are the output of different commands. I am using following to
> > discover the target:
> > 
> >  
> > 
> > /etc/init.d/opensmd start
> > 
> > /etc/init.d/openibd start
> > 
> > modprobe ib_srp
> > 
> > echo
> >
>
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
> /sys/class/infiniband_srp/srp-mthca0-2/add_target 
> > 
> >  
> > 
> >  
> > 
> > [root at p49 ~]# ibv_devinfo
> > 
> > hca_id: mthca0
> > 
> >         fw_ver:                         5.1.400
> > 
> >         node_guid:                      0002:c902:0022:cce0
> > 
> >         sys_image_guid:                 0002:c902:0022:cce3
> > 
> >         vendor_id:                      0x02c9
> > 
> >         vendor_part_id:                 25218
> > 
> >         hw_ver:                         0xA0
> > 
> >         board_id:                       MT_0370130002
> > 
> >         phys_port_cnt:                  2
> > 
> >                 port:   1
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >                 port:   2
> > 
> >                         state:                  PORT_ACTIVE (4)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             2048 (4)
> > 
> >                         sm_lid:                 1
> > 
> >                         port_lid:               1
> > 
> >                         port_lmc:               0x00
> > hca_id: mthca1
> > 
> >         fw_ver:                         5.1.400
> > 
> >         node_guid:                      0002:c902:0022:cd2c
> > 
> >         sys_image_guid:                 0002:c902:0022:cd2f
> > 
> >         vendor_id:                      0x02c9
> > 
> >         vendor_part_id:                 25218
> > 
> >         hw_ver:                         0xA0
> > 
> >         board_id:                       MT_0370130002
> > 
> >         phys_port_cnt:                  2
> > 
> >                 port:   1
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >                 port:   2
> > 
> >                         state:                  PORT_DOWN (1)
> > 
> >                         max_mtu:                2048 (4)
> > 
> >                         active_mtu:             512 (2)
> > 
> >                         sm_lid:                 0
> > 
> >                         port_lid:               0
> > 
> >                         port_lmc:               0x00
> > 
> >  
> > 
> >  
> > 
> > [root at p49 ~]# uname -a
> > 
> > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > 
> >  
> > 
> > [root at p49 ~]# cat /etc/infiniband/info
> > 
> > #!/bin/bash
> > 
> >  
> > 
> > echo prefix=/usr/local/ofed
> > 
> > echo Kernel=2.6.9-42.0.3.ELsmp
> > 
> > echo
> > 
> > echo "Configure options: --with-dapl --with-ipoibtools
--with-libibcm
> > --with-libibcommon --with-libibmad --with-libibumad
--with-libibverbs
> > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> > --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> > 
> > echo
> > 
> >  
> > 
> > OFED Version: OFED-1.1
> 
> 
> 
> > 
> > Thanks
> > 
> > Ashish
> > 
> > -----Original Message-----
> > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> > Sent: Tuesday, December 19, 2006 5:18 AM
> > To: Batwara, Ashish
> > Cc: ishai at mellanox.co.il; openib-general at openib.org
> > Subject: Re: [openib-general] opensm
> > 
> >  
> > 
> > Hi Ashish,
> > 
> >  
> > 
> > SRP people say they have no such error message.
> > 
> > OpenSM does. So I take it back.
> > 
> >  
> > 
> > Ashish,
> > 
> > Please provide more into:
> > 
> >  
> > 
> > 1. ibv_devinfo
> > 
> > 2. Version of code you are using
> > 
> > 3. Command line you use for starting opensm
> > 
> > 4. /var/log/osm.log
> > 
> >  
> > 
> > Thanks and sorry for the confusion.
> > 
> >  
> > 
> > EZ
> > 
> >  
> > 
> > Eitan Zahavi wrote:
> > 
> > > This is not an OpenSM issue.
> > 
> > > Forwarded to the SRP people.
> > 
> > > 
> > 
> > > EZ
> > 
> > > Batwara, Ashish wrote:
> > 
> > >   
> > 
> > >> Hi,
> > 
> > >> I am trying to run opensm on Linux server. It has two HCAs
> > (4-ports) and
> > 
> > >> connected to IB Switch. ibnodes command displays the information
> > about
> > 
> > >> the Switch ports and HCA ports.
> > 
> > >> When I start opensm, I see in /var/log/messages "Starting
> > srp_daemon"
> > 
> > >> for all the 4 ports and immediately after I see "failed
srp_daemon"
> > for
> > 
> > >> all the ports and the displays "SM Port is down".
> > 
> > >> 
> > 
> > >> I tried several times and even rebooted the server few times but
no
> > 
> > >> luck.
> > 
> > >> 
> > 
> > >> Does anybody know what this problem is?
> > 
> > >> 
> > 
> > >> Thanks
> > 
> > >> Ashish
> > 
> > >> 
> > 
> > >> _______________________________________________
> > 
> > >> openib-general mailing list
> > 
> > >> openib-general at openib.org
> > 
> > >> http://openib.org/mailman/listinfo/openib-general
> > 
> > >> 
> > 
> > >> To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > >>   
> > 
> > >>     
> > 
> > > 
> > 
> > > 
> > 
> > > _______________________________________________
> > 
> > > openib-general mailing list
> > 
> > > openib-general at openib.org
> > 
> > > http://openib.org/mailman/listinfo/openib-general
> > 
> > > 
> > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > >   
> > 
> >  
> > 
> > 
> > 
> >
______________________________________________________________________
> > 
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> > 
> > To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Wed Dec 20 10:21:49 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 13:21:49 -0500
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com>
Message-ID: <1166638908.4519.57147.camel@hal.voltaire.com>

Hi,

On Wed, 2006-12-20 at 13:06, Batwara, Ashish wrote:
> Hi,
> Please see the information below
> 
> This is what I did:
> /etc/init.d/openibd start
> /etc/init.d/opensmd  start

>From where does OpenSM get its parameters ? What are they ?

> modprobe ib_srp
> 
> Issued the command /usr/local/ofed/sbin/ibsrpdm -c    to get the
> information about target and used them in 
> 
> echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
>  
> dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114
> 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target
> 
> Yes, earlier I had silverstorm switch which was running SM but now I
> have taken that out and directly connecting the target and host.
> 
> I have only one port connected between the host and the target. 
> The reason behind link is not stable is that I am restarting and
> stopping again and again, as this does not seem to be working and I did
> not know the issue until I looked at the console log which was
> indicating "Got failed path rec status -110" and after seeing that I
> searched on goggle and found that
> "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht
> ml" 

That's pretty old email.

> it seems to be a bug with 64-bit machine.
> BTW, my linux server is 64-bit.
> When I hooked up 32-bit server running OFED-1.1, I see my target
> discovered with the same procedure.

OpenSM has run successfully on 64 bit servers (as part of OFED 1.1).

> So, whole question is that what is the fix for issue "Got failed path
> rec status -110" on 64-bit machine.

I'm not sure what the problem is and I'm not sufficiently familiar with
building it from the OFED distribution on a 64 bit machine.

-- Hal

> Thanks
> Ashish
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, December 19, 2006 10:35 PM
> To: Batwara, Ashish
> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> Subject: RE: [openib-general] opensm
> 
> On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote:
> > Hi,
> > Please look towards the end of the attached file.
> 
> What options are you starting opensm with ? What is the command line ?
> 
> Also, it looks like (at least at one point) you have another SM on the
> subnet. What is the make (vendor) for your switch ?
> 
> I see many SM port is DOWN. What is going on with this port ? Why is the
> physical link not LinkUp and stable ? That is the main issue and is
> likely why the SubnGet of NodeInfo is not being responded to.
> 
> -- Hal
> 
> > Thanks
> > Ashish
> > 
> > -----Original Message-----
> > From: Hal Rosenstock [mailto:halr at voltaire.com] 
> > Sent: Tuesday, December 19, 2006 5:06 PM
> > To: Batwara, Ashish
> > Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> > Subject: Re: [openib-general] opensm
> > 
> > Ashish,
> > 
> > On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
> > > Hi,
> > > 
> > > Here is the info that you have asked. I am seeing the Subnet manager
> > > is up now having the port active. But server is not able to discover
> > > the target. I am seeing the error "Got failed path rec status -110"
> on
> > > Linux console. 
> > 
> > That means the request for an SA PathRecord from the initiator to the
> > target failed (-110 is ETIMEDOUT). Are you sure the target is up
> > (ACTIVE) on the subnet ? If it is, can you send the opensm log ?
> > 
> > -- Hal
> > 
> > > Below are the output of different commands. I am using following to
> > > discover the target:
> > > 
> > >  
> > > 
> > > /etc/init.d/opensmd start
> > > 
> > > /etc/init.d/openibd start
> > > 
> > > modprobe ib_srp
> > > 
> > > echo
> > >
> >
> id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
> > 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
> > /sys/class/infiniband_srp/srp-mthca0-2/add_target 
> > > 
> > >  
> > > 
> > >  
> > > 
> > > [root at p49 ~]# ibv_devinfo
> > > 
> > > hca_id: mthca0
> > > 
> > >         fw_ver:                         5.1.400
> > > 
> > >         node_guid:                      0002:c902:0022:cce0
> > > 
> > >         sys_image_guid:                 0002:c902:0022:cce3
> > > 
> > >         vendor_id:                      0x02c9
> > > 
> > >         vendor_part_id:                 25218
> > > 
> > >         hw_ver:                         0xA0
> > > 
> > >         board_id:                       MT_0370130002
> > > 
> > >         phys_port_cnt:                  2
> > > 
> > >                 port:   1
> > > 
> > >                         state:                  PORT_DOWN (1)
> > > 
> > >                         max_mtu:                2048 (4)
> > > 
> > >                         active_mtu:             512 (2)
> > > 
> > >                         sm_lid:                 0
> > > 
> > >                         port_lid:               0
> > > 
> > >                         port_lmc:               0x00
> > > 
> > >  
> > > 
> > >                 port:   2
> > > 
> > >                         state:                  PORT_ACTIVE (4)
> > > 
> > >                         max_mtu:                2048 (4)
> > > 
> > >                         active_mtu:             2048 (4)
> > > 
> > >                         sm_lid:                 1
> > > 
> > >                         port_lid:               1
> > > 
> > >                         port_lmc:               0x00
> > > hca_id: mthca1
> > > 
> > >         fw_ver:                         5.1.400
> > > 
> > >         node_guid:                      0002:c902:0022:cd2c
> > > 
> > >         sys_image_guid:                 0002:c902:0022:cd2f
> > > 
> > >         vendor_id:                      0x02c9
> > > 
> > >         vendor_part_id:                 25218
> > > 
> > >         hw_ver:                         0xA0
> > > 
> > >         board_id:                       MT_0370130002
> > > 
> > >         phys_port_cnt:                  2
> > > 
> > >                 port:   1
> > > 
> > >                         state:                  PORT_DOWN (1)
> > > 
> > >                         max_mtu:                2048 (4)
> > > 
> > >                         active_mtu:             512 (2)
> > > 
> > >                         sm_lid:                 0
> > > 
> > >                         port_lid:               0
> > > 
> > >                         port_lmc:               0x00
> > > 
> > >  
> > > 
> > >                 port:   2
> > > 
> > >                         state:                  PORT_DOWN (1)
> > > 
> > >                         max_mtu:                2048 (4)
> > > 
> > >                         active_mtu:             512 (2)
> > > 
> > >                         sm_lid:                 0
> > > 
> > >                         port_lid:               0
> > > 
> > >                         port_lmc:               0x00
> > > 
> > >  
> > > 
> > >  
> > > 
> > > [root at p49 ~]# uname -a
> > > 
> > > Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
> > > EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > > 
> > >  
> > > 
> > > [root at p49 ~]# cat /etc/infiniband/info
> > > 
> > > #!/bin/bash
> > > 
> > >  
> > > 
> > > echo prefix=/usr/local/ofed
> > > 
> > > echo Kernel=2.6.9-42.0.3.ELsmp
> > > 
> > > echo
> > > 
> > > echo "Configure options: --with-dapl --with-ipoibtools
> --with-libibcm
> > > --with-libibcommon --with-libibmad --with-libibumad
> --with-libibverbs
> > > --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
> > > --with-libsdp --with-openib-diags --with-srptools --with-mstflint
> > > --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
> > > --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
> > > --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
> > > 
> > > echo
> > > 
> > >  
> > > 
> > > OFED Version: OFED-1.1
> > 
> > 
> > 
> > > 
> > > Thanks
> > > 
> > > Ashish
> > > 
> > > -----Original Message-----
> > > From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
> > > Sent: Tuesday, December 19, 2006 5:18 AM
> > > To: Batwara, Ashish
> > > Cc: ishai at mellanox.co.il; openib-general at openib.org
> > > Subject: Re: [openib-general] opensm
> > > 
> > >  
> > > 
> > > Hi Ashish,
> > > 
> > >  
> > > 
> > > SRP people say they have no such error message.
> > > 
> > > OpenSM does. So I take it back.
> > > 
> > >  
> > > 
> > > Ashish,
> > > 
> > > Please provide more into:
> > > 
> > >  
> > > 
> > > 1. ibv_devinfo
> > > 
> > > 2. Version of code you are using
> > > 
> > > 3. Command line you use for starting opensm
> > > 
> > > 4. /var/log/osm.log
> > > 
> > >  
> > > 
> > > Thanks and sorry for the confusion.
> > > 
> > >  
> > > 
> > > EZ
> > > 
> > >  
> > > 
> > > Eitan Zahavi wrote:
> > > 
> > > > This is not an OpenSM issue.
> > > 
> > > > Forwarded to the SRP people.
> > > 
> > > > 
> > > 
> > > > EZ
> > > 
> > > > Batwara, Ashish wrote:
> > > 
> > > >   
> > > 
> > > >> Hi,
> > > 
> > > >> I am trying to run opensm on Linux server. It has two HCAs
> > > (4-ports) and
> > > 
> > > >> connected to IB Switch. ibnodes command displays the information
> > > about
> > > 
> > > >> the Switch ports and HCA ports.
> > > 
> > > >> When I start opensm, I see in /var/log/messages "Starting
> > > srp_daemon"
> > > 
> > > >> for all the 4 ports and immediately after I see "failed
> srp_daemon"
> > > for
> > > 
> > > >> all the ports and the displays "SM Port is down".
> > > 
> > > >> 
> > > 
> > > >> I tried several times and even rebooted the server few times but
> no
> > > 
> > > >> luck.
> > > 
> > > >> 
> > > 
> > > >> Does anybody know what this problem is?
> > > 
> > > >> 
> > > 
> > > >> Thanks
> > > 
> > > >> Ashish
> > > 
> > > >> 
> > > 
> > > >> _______________________________________________
> > > 
> > > >> openib-general mailing list
> > > 
> > > >> openib-general at openib.org
> > > 
> > > >> http://openib.org/mailman/listinfo/openib-general
> > > 
> > > >> 
> > > 
> > > >> To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > >>   
> > > 
> > > >>     
> > > 
> > > > 
> > > 
> > > > 
> > > 
> > > > _______________________________________________
> > > 
> > > > openib-general mailing list
> > > 
> > > > openib-general at openib.org
> > > 
> > > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > > 
> > > 
> > > > To unsubscribe, please visit
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > >   
> > > 
> > >  
> > > 
> > > 
> > > 
> > >
> ______________________________________________________________________
> > > 
> > > _______________________________________________
> > > openib-general mailing list
> > > openib-general at openib.org
> > > http://openib.org/mailman/listinfo/openib-general
> > > 
> > > To unsubscribe, please visit
> > http://openib.org/mailman/listinfo/openib-general
> > 
> 


From halr at voltaire.com  Wed Dec 20 11:03:30 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 14:03:30 -0500
Subject: [openib-general] [PATCH 1/2]: OpenSM/osm_sa_informinfo.c: Fix
 InformInfoRecord searches
Message-ID: <1166641409.4519.59078.camel@hal.voltaire.com>

OpenSM/osm_sa_informinfo.c: Fix InformInfoRecord searches

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sa_informinfo.c b/osm/opensm/osm_sa_informinfo.c
index 06ea90c..374b61d 100644
--- a/osm/opensm/osm_sa_informinfo.c
+++ b/osm/opensm/osm_sa_informinfo.c
@@ -368,8 +368,6 @@ __osm_sa_inform_info_rec_by_comp_mask(
   osm_port_t *             p_subscriber_port;
   osm_physp_t *            p_subscriber_physp;
   const osm_physp_t*       p_req_physp;
-  osm_infr_t*              p_infr_rec = NULL;
-  ib_inform_info_record_t  inform_info_rec;
   osm_iir_item_t*          p_rec_item;
 
   OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_inform_info_rec_by_comp_mask );
@@ -378,72 +376,58 @@ __osm_sa_inform_info_rec_by_comp_mask(
   comp_mask = p_ctxt->comp_mask;
   p_req_physp = p_ctxt->p_req_physp;
 
-  /* Both subscriber GID and enum specified */
-  if ((comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID) &&
-      (comp_mask & IB_IIR_COMPMASK_ENUM))
-  {
-    inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid;
-    inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum;
-    p_infr_rec = osm_infr_get_by_rid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
-    goto Done;
-  }
-
   if (comp_mask & IB_IIR_COMPMASK_SUBSCRIBERGID)
   {
-    inform_info_rec.subscriber_gid = p_ctxt->subscriber_gid;
-    p_infr_rec = osm_infr_get_by_gid(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
-    goto Done;
+    if (memcmp(&p_infr->inform_record.subscriber_gid,
+	       &p_ctxt->subscriber_gid,
+	       sizeof(p_infr->inform_record.subscriber_gid)))
+      goto Exit; 
   }
 
   if (comp_mask & IB_IIR_COMPMASK_ENUM)
   {
-    inform_info_rec.subscriber_enum = p_ctxt->subscriber_enum;
-    p_infr_rec = osm_infr_get_by_enum(p_rcv->p_subn, p_rcv->p_log, &inform_info_rec);
-    goto Done;
+    if (p_infr->inform_record.subscriber_enum != p_ctxt->subscriber_enum)
+      goto Exit;
   }
 
   /* Implement any other needed search cases */
 
-Done:
-  if (p_infr_rec)
+  /* Ensure pkey is shared before returning any records */
+  portguid = p_infr->inform_record.subscriber_gid.unicast.interface_id;
+  p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid );
+  if ( p_subscriber_port == NULL )
   {
-    /* Ensure pkey is shared before returning any records */
-    portguid = p_infr_rec->inform_record.subscriber_gid.unicast.interface_id;
-    p_subscriber_port = osm_get_port_by_guid( p_rcv->p_subn, portguid );
-    if ( p_subscriber_port == NULL )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: "
-               "Invalid subscriber port guid: 0x%016" PRIx64 "\n",
-               cl_ntoh64(portguid) );
-      goto Exit;
-    }
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_sa_inform_info_rec_by_comp_mask: ERR 430D: "
+             "Invalid subscriber port guid: 0x%016" PRIx64 "\n",
+             cl_ntoh64(portguid) );
+    goto Exit;
+  }
 
-    /* get the subscriber InformInfo physical port */
-    p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port);
-    /* make sure that the requester and subscriber port can access each other 
-       according to the current partitioning. */
-    if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp ))
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
-               "__osm_sa_inform_info_rec_by_comp_mask: "
-               "requester and subscriber ports don't share pkey\n" );
-      goto Exit;
-    }
+  /* get the subscriber InformInfo physical port */
+  p_subscriber_physp = osm_port_get_default_phys_ptr(p_subscriber_port);
+  /* make sure that the requester and subscriber port can access each other 
+     according to the current partitioning. */
+  if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_subscriber_physp ))
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_sa_inform_info_rec_by_comp_mask: "
+             "requester and subscriber ports don't share pkey\n" );
+    goto Exit;
+  }
  
-    p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool );
-    if( p_rec_item == NULL )
-    {
-      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
-               "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: "
-               "cl_qlock_pool_get failed\n" );
-      goto Exit;
-    }
-
-    memcpy((void *)&p_rec_item->rec, (void *)&p_infr_rec->inform_record, sizeof(ib_inform_info_record_t));
-    cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+  p_rec_item = (osm_iir_item_t*)cl_qlock_pool_get( &p_rcv->pool );
+  if( p_rec_item == NULL )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_sa_inform_info_rec_by_comp_mask: ERR 430E: "
+             "cl_qlock_pool_get failed\n" );
+    goto Exit;
   }
 
+  memcpy((void *)&p_rec_item->rec, (void *)&p_infr->inform_record, sizeof(ib_inform_info_record_t));
+  cl_qlist_insert_tail( p_ctxt->p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+
 Exit:
   OSM_LOG_EXIT( p_rcv->p_log );
 }


From halr at voltaire.com  Wed Dec 20 11:03:45 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 14:03:45 -0500
Subject: [openib-general] [PATCH 2/2] OpenSM: Eliminate no longer needed
 routines in osm_inform.c
Message-ID: <1166641411.4519.59080.camel@hal.voltaire.com>

OpenSM: Eliminate no longer needed routines in osm_inform.c

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_inform.h b/osm/include/opensm/osm_inform.h
index 0bc8810..3e8e122 100644
--- a/osm/include/opensm/osm_inform.h
+++ b/osm/include/opensm/osm_inform.h
@@ -223,103 +223,6 @@ osm_infr_destroy(
 *	Inform Record, osm_infr_construct, osm_infr_destroy
 *********/
 
-/****f* OpenSM: Inform Record/osm_infr_get_by_rid
-* NAME
-*	osm_infr_get_by_rid
-*
-* DESCRIPTION
-*	Find a matching osm_infr_t in the subnet DB by inform_info_record RID 
-*
-* SYNOPSIS
-*/
-osm_infr_t*
-osm_infr_get_by_rid(
-	IN osm_subn_t	const	*p_subn,
-	IN osm_log_t	*p_log,
-	IN ib_inform_info_record_t* const p_inf_rec );
-/*
-* PARAMETERS
-*	p_subn 
-*		[in] Pointer to the subnet object
-*
-*	p_log
-*		[in] Pointer to the log object
-*
-*	p_inf_rec
-*		[in] Pointer to an inform_info record with the search RID
-*
-* RETURN
-*	The matching osm_infr_t
-* SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
-*********/
-
-/****f* OpenSM: Inform Record/osm_infr_get_by_gid
-* NAME
-*	osm_infr_get_by_gid
-*
-* DESCRIPTION
-*	Find a matching osm_infr_t in the subnet DB by inform_info_record
-*	subscriber GID
-*
-* SYNOPSIS
-*/
-osm_infr_t*
-osm_infr_get_by_gid(
-	IN osm_subn_t	const	*p_subn,
-	IN osm_log_t	*p_log,
-	IN ib_inform_info_record_t* const p_inf_rec );
-/*
-* PARAMETERS
-*	p_subn 
-*		[in] Pointer to the subnet object
-*
-*	p_log
-*		[in] Pointer to the log object
-*
-*	p_inf_rec
-*		[in] Pointer to an inform_info record with the search
-*		     subscriber GID
-*
-* RETURN
-*	The matching osm_infr_t
-* SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
-*********/
-
-/****f* OpenSM: Inform Record/osm_infr_get_by_enum
-* NAME
-*       osm_infr_get_by_enum
-*
-* DESCRIPTION
-*       Find a matching osm_infr_t in the subnet DB by inform_info_record
-*       subscriber enum 
-*
-* SYNOPSIS
-*/
-osm_infr_t*
-osm_infr_get_by_enum(
-	IN osm_subn_t	const	*p_subn,
-	IN osm_log_t	*p_log,
-	IN ib_inform_info_record_t* const p_inf_rec );
-/*
-* PARAMETERS
-*	p_subn 
-*		[in] Pointer to the subnet object
-*
-*	p_log
-*		[in] Pointer to the log object
-*
-*	p_inf_rec
-*		[in] Pointer to an inform_info record with the search
-*		     subscriber enum 
-*
-* RETURN
-*	The matching osm_infr_t
-* SEE ALSO
-*	Inform Record, osm_infr_construct, osm_infr_destroy
-*********/
-
 /****f* OpenSM: Inform Record/osm_infr_get_by_rec
 * NAME
 *	osm_infr_get_by_rec
diff --git a/osm/opensm/osm_inform.c b/osm/opensm/osm_inform.c
index 074a3f9..98b7ec4 100644
--- a/osm/opensm/osm_inform.c
+++ b/osm/opensm/osm_inform.c
@@ -117,148 +117,6 @@ osm_infr_new(
 }
 
 /**********************************************************************
- * Match an infr by the RID of the stored inform_info_record
- **********************************************************************/
-static cl_status_t
-__match_rid_of_inf_rec(
-  IN  const cl_list_item_t* const p_list_item,
-  IN  void*                       context )
-{
-  ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t  *)context;
-  osm_infr_t* p_infr = (osm_infr_t*)p_list_item;
-  int32_t count;
-
-  count = memcmp(
-    &p_infr->inform_record,
-    p_infr_rec,
-    sizeof(p_infr_rec->subscriber_gid) +
-    sizeof(p_infr_rec->subscriber_enum) );
-
-  if(count == 0)
-    return CL_SUCCESS;
-  else
-    return CL_NOT_FOUND;
-}
-
-/**********************************************************************
- * Match an infr by the subscriber GID of the stored inform_info_record
- **********************************************************************/
-static cl_status_t
-__match_gid_of_inf_rec(
-  IN  const cl_list_item_t* const p_list_item,
-  IN  void*                       context )
-{
-  ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t  *)context;
-  osm_infr_t* p_infr = (osm_infr_t*)p_list_item;
-  int32_t count;
-
-  count = memcmp(
-    &p_infr->inform_record,
-    p_infr_rec,
-    sizeof(p_infr_rec->subscriber_gid) );
-
-  if(count == 0)
-    return CL_SUCCESS;
-  else
-    return CL_NOT_FOUND;
-}
-
-/**********************************************************************
- * Match an infr by the subscriber enum of the stored inform_info_record
- **********************************************************************/
-static cl_status_t
-__match_enum_of_inf_rec(
-  IN  const cl_list_item_t* const p_list_item,
-  IN  void*                       context )
-{
-  ib_inform_info_record_t* p_infr_rec = (ib_inform_info_record_t  *)context;
-  osm_infr_t* p_infr = (osm_infr_t*)p_list_item;
-  int32_t count;
-
-  count = memcmp(
-    &p_infr->inform_record.subscriber_enum,
-    &p_infr_rec->subscriber_enum,
-    sizeof(p_infr_rec->subscriber_enum) );
-
-  if(count == 0)
-    return CL_SUCCESS;
-  else
-    return CL_NOT_FOUND;
-}
-
-/**********************************************************************
- **********************************************************************/
-osm_infr_t*
-osm_infr_get_by_rid(
-  IN osm_subn_t const *p_subn,
-  IN osm_log_t *p_log,
-  IN ib_inform_info_record_t* const p_infr_rec )
-{
-  cl_list_item_t* p_list_item;
-
-  OSM_LOG_ENTER( p_log, osm_infr_get_by_rid );
-
-  p_list_item = cl_qlist_find_from_head(
-    &p_subn->sa_infr_list,
-    __match_rid_of_inf_rec,
-    p_infr_rec );
-
-  if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) )
-    p_list_item = NULL;
-
-  OSM_LOG_EXIT( p_log );
-  return (osm_infr_t*)p_list_item;
-}
-
-/**********************************************************************
- **********************************************************************/
-osm_infr_t*
-osm_infr_get_by_gid(
-  IN osm_subn_t const *p_subn,
-  IN osm_log_t *p_log,
-  IN ib_inform_info_record_t* const p_infr_rec )
-{
-  cl_list_item_t* p_list_item;
-
-  OSM_LOG_ENTER( p_log, osm_infr_get_by_gid );
-
-  p_list_item = cl_qlist_find_from_head(
-    &p_subn->sa_infr_list,
-    __match_gid_of_inf_rec,
-    p_infr_rec );
-
-  if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) )
-    p_list_item = NULL;
-
-  OSM_LOG_EXIT( p_log );
-  return (osm_infr_t*)p_list_item;
-}
-
-/**********************************************************************
- **********************************************************************/
-osm_infr_t*
-osm_infr_get_by_enum(
-  IN osm_subn_t const *p_subn,
-  IN osm_log_t *p_log,
-  IN ib_inform_info_record_t* const p_infr_rec )
-{
-  cl_list_item_t* p_list_item;
-
-  OSM_LOG_ENTER( p_log, osm_infr_get_by_enum );
-
-  p_list_item = cl_qlist_find_from_head(
-    &p_subn->sa_infr_list,
-    __match_enum_of_inf_rec,
-    p_infr_rec );
-
-  if( p_list_item == cl_qlist_end( &p_subn->sa_infr_list ) )
-    p_list_item = NULL;
-
-  OSM_LOG_EXIT( p_log );
-  return (osm_infr_t*)p_list_item;
-}
-
-/**********************************************************************
  **********************************************************************/
 void
 __dump_all_informs(


From swise at opengridcomputing.com  Wed Dec 20 11:17:54 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:17:54 -0600
Subject: [openib-general] [PATCH v5 00/13] iw_cxgb3 - Chelsio T3 RDMA Driver
Message-ID: <20061220191754.19316.4914.stgit@dell3.ogc.int>


Roland, 

I think this is ready to go once the ethernet driver is pulled in by Jeff.  

Also: I'm gone after today returning Wednesday Jan 3rd.  I'll address any
new issues when I return.  

Cheers!

Steve.

----

Version 5 changes:

- BugFix: fixed broken endpoint state serialization
- Merged up to linus's tree as of 12/18/2006 (2.6.20-rc1)
- Removed all blank characters at the end of lines

The following series implements the Chelsio T3 iWARP/RDMA Driver to
be considered for inclusion in 2.6.20.  It depends on the Chelsio T3
Ethernet driver which is also under review now for 2.6.20. 

The latest Chelsio T3 Ethernet driver patch can be pulled from:

  http://service.chelsio.com/kernel.org/cxgb3.patch.bz2

This T3 iWARP/RDMA Driver patch series can be pulled from:

  http://www.opengridcomputing.com/downloads/iw_cxgb3_patches_v5.tar.bz2

A complete GIT kernel tree with all the T3 drivers can be pulled from:

  git://staging.openfabrics.org/~swise/cxgb3.git


From swise at opengridcomputing.com  Wed Dec 20 11:18:24 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:18:24 -0600
Subject: [openib-general] [PATCH v5 01/13] iw_cxgb3 Linux RDMA Core Changes
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220191824.19316.93248.stgit@dell3.ogc.int>


Support provider-specific data in ib_uverbs_cmd_req_notify_cq().
The Chelsio iwarp provider library needs to pass information to the
kernel verb for re-arming the CQ.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/core/uverbs_cmd.c      |    9 +++++++--
 drivers/infiniband/hw/amso1100/c2.h       |    2 +-
 drivers/infiniband/hw/amso1100/c2_cq.c    |    3 ++-
 drivers/infiniband/hw/ehca/ehca_iverbs.h  |    3 ++-
 drivers/infiniband/hw/ehca/ehca_reqs.c    |    3 ++-
 drivers/infiniband/hw/ipath/ipath_cq.c    |    4 +++-
 drivers/infiniband/hw/ipath/ipath_verbs.h |    3 ++-
 drivers/infiniband/hw/mthca/mthca_cq.c    |    6 ++++--
 drivers/infiniband/hw/mthca/mthca_dev.h   |    4 ++--
 include/rdma/ib_verbs.h                   |    5 +++--
 10 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..5dd1de9 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -959,6 +959,7 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 				int out_len)
 {
 	struct ib_uverbs_req_notify_cq cmd;
+	struct ib_udata		      udata;
 	struct ib_cq                  *cq;
 
 	if (copy_from_user(&cmd, buf, sizeof cmd))
@@ -968,8 +969,12 @@ ssize_t ib_uverbs_req_notify_cq(struct i
 	if (!cq)
 		return -EINVAL;
 
-	ib_req_notify_cq(cq, cmd.solicited_only ?
-			 IB_CQ_SOLICITED : IB_CQ_NEXT_COMP);
+	INIT_UDATA(&udata, buf + sizeof cmd, 0,
+		   in_len - sizeof cmd, 0); 
+
+	cq->device->req_notify_cq(cq, cmd.solicited_only ?
+				  IB_CQ_SOLICITED : IB_CQ_NEXT_COMP,
+				  &udata);
 
 	put_cq_read(cq);
 
diff --git a/drivers/infiniband/hw/amso1100/c2.h b/drivers/infiniband/hw/amso1100/c2.h
index 04a9db5..9a76869 100644
--- a/drivers/infiniband/hw/amso1100/c2.h
+++ b/drivers/infiniband/hw/amso1100/c2.h
@@ -519,7 +519,7 @@ extern void c2_free_cq(struct c2_dev *c2
 extern void c2_cq_event(struct c2_dev *c2dev, u32 mq_index);
 extern void c2_cq_clean(struct c2_dev *c2dev, struct c2_qp *qp, u32 mq_index);
 extern int c2_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *entry);
-extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+extern int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify, struct ib_udata *udata);
 
 /* CM */
 extern int c2_llp_connect(struct iw_cm_id *cm_id,
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..7ce8bca 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -217,7 +217,8 @@ int c2_poll_cq(struct ib_cq *ibcq, int n
 	return npolled;
 }
 
-int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int c2_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+	      struct ib_udata *udata)
 {
 	struct c2_mq_shared __iomem *shared;
 	struct c2_cq *cq;
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 3720e30..566b30c 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -135,7 +135,8 @@ int ehca_poll_cq(struct ib_cq *cq, int n
 
 int ehca_peek_cq(struct ib_cq *cq, int wc_cnt);
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify);
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata);
 
 struct ib_qp *ehca_create_qp(struct ib_pd *pd,
 			     struct ib_qp_init_attr *init_attr,
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..3ed6992 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -634,7 +634,8 @@ poll_cq_exit0:
 	return ret;
 }
 
-int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify)
+int ehca_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify cq_notify,
+		       struct ib_udata *udata)
 {
 	struct ehca_cq *my_cq = container_of(cq, struct ehca_cq, ib_cq);
 
diff --git a/drivers/infiniband/hw/ipath/ipath_cq.c b/drivers/infiniband/hw/ipath/ipath_cq.c
index 87462e0..27ba4db 100644
--- a/drivers/infiniband/hw/ipath/ipath_cq.c
+++ b/drivers/infiniband/hw/ipath/ipath_cq.c
@@ -307,13 +307,15 @@ int ipath_destroy_cq(struct ib_cq *ibcq)
  * ipath_req_notify_cq - change the notification type for a completion queue
  * @ibcq: the completion queue
  * @notify: the type of notification to request
+ * @udata: user data 
  *
  * Returns 0 for success.
  *
  * This may be called from interrupt context.  Also called by
  * ib_req_notify_cq() in the generic verbs code.
  */
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata)
 {
 	struct ipath_cq *cq = to_icq(ibcq);
 	unsigned long flags;
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.h b/drivers/infiniband/hw/ipath/ipath_verbs.h
index c0c8d5b..7db01ae 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.h
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.h
@@ -716,7 +716,8 @@ struct ib_cq *ipath_create_cq(struct ib_
 
 int ipath_destroy_cq(struct ib_cq *ibcq);
 
-int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify);
+int ipath_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+			struct ib_udata *udata);
 
 int ipath_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 283d50b..15cbd49 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -722,7 +722,8 @@ repoll:
 	return err == 0 || err == -EAGAIN ? npolled : err;
 }
 
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
+		       struct ib_udata *udata)
 {
 	__be32 doorbell[2];
 
@@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
 	return 0;
 }
 
-int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
+int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+		       struct ib_udata *udata)
 {
 	struct mthca_cq *cq = to_mcq(ibcq);
 	__be32 doorbell[2];
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
index fe5cecf..6b9ccf6 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
 
 int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
 		  struct ib_wc *entry);
-int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
-int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
+int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
+int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
 int mthca_init_cq(struct mthca_dev *dev, int nent,
 		  struct mthca_ucontext *ctx, u32 pdn,
 		  struct mthca_cq *cq);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0bfa332..4dc771f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -986,7 +986,8 @@ struct ib_device {
 					      struct ib_wc *wc);
 	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
 	int                        (*req_notify_cq)(struct ib_cq *cq,
-						    enum ib_cq_notify cq_notify);
+						    enum ib_cq_notify cq_notify,
+						    struct ib_udata *udata);
 	int                        (*req_ncomp_notif)(struct ib_cq *cq,
 						      int wc_cnt);
 	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
@@ -1420,7 +1421,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
 static inline int ib_req_notify_cq(struct ib_cq *cq,
 				   enum ib_cq_notify cq_notify)
 {
-	return cq->device->req_notify_cq(cq, cq_notify);
+	return cq->device->req_notify_cq(cq, cq_notify, NULL);
 }
 
 /**


From swise at opengridcomputing.com  Wed Dec 20 11:18:54 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:18:54 -0600
Subject: [openib-general] [PATCH v5 02/13] iw_cxgb3 Device Discovery and
	ULLD Linkage
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220191854.19316.18353.stgit@dell3.ogc.int>


Code to discover all the T3 devices and register them 
with the T3 RDMA Core and the Linux RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch.c |  189 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch.h |  175 +++++++++++++++++++++++++++++++++
 2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch.c b/drivers/infiniband/hw/cxgb3/iwch.c
new file mode 100644
index 0000000..0c95f2c
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.c
@@ -0,0 +1,189 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+#include "iwch_user.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+
+#define DRV_VERSION "1.1"
+
+MODULE_AUTHOR("Boyd Faulkner, Steve Wise");
+MODULE_DESCRIPTION("Chelsio T3 RDMA Driver");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION(DRV_VERSION);
+
+cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+
+static void open_rnic_dev(struct t3cdev *);
+static void close_rnic_dev(struct t3cdev *);
+
+struct cxgb3_client t3c_client = {
+	.name = "iw_cxgb3",
+	.add = open_rnic_dev,
+	.remove = close_rnic_dev,
+	.handlers = t3c_handlers,
+	.redirect = iwch_ep_redirect
+};
+
+static LIST_HEAD(dev_list);
+static DEFINE_MUTEX(dev_mutex);
+
+static void rnic_init(struct iwch_dev *rnicp)
+{
+	PDBG("%s iwch_dev %p\n", __FUNCTION__,  rnicp);
+	idr_init(&rnicp->cqidr);
+	idr_init(&rnicp->qpidr);
+	idr_init(&rnicp->mmidr);
+	spin_lock_init(&rnicp->lock);
+
+	rnicp->attr.vendor_id = 0x168;
+	rnicp->attr.vendor_part_id = 7;
+	rnicp->attr.max_qps = T3_MAX_NUM_QP - 32;
+	rnicp->attr.max_wrs = (1UL << 24) - 1;
+	rnicp->attr.max_sge_per_wr = T3_MAX_SGE;
+	rnicp->attr.max_sge_per_rdma_write_wr = T3_MAX_SGE;
+	rnicp->attr.max_cqs = T3_MAX_NUM_CQ - 1;
+	rnicp->attr.max_cqes_per_cq = (1UL << 24) - 1;
+	rnicp->attr.max_mem_regs = cxio_num_stags(&rnicp->rdev);
+	rnicp->attr.max_phys_buf_entries = T3_MAX_PBL_SIZE;
+	rnicp->attr.max_pds = T3_MAX_NUM_PD - 1;
+	rnicp->attr.mem_pgsizes_bitmask = 0x7FFF;	/* 4KB-128MB */
+	rnicp->attr.can_resize_wq = 0;
+	rnicp->attr.max_rdma_reads_per_qp = 8;
+	rnicp->attr.max_rdma_read_resources =
+	    rnicp->attr.max_rdma_reads_per_qp * rnicp->attr.max_qps;
+	rnicp->attr.max_rdma_read_qp_depth = 8;	/* IRD */
+	rnicp->attr.max_rdma_read_depth =
+	    rnicp->attr.max_rdma_read_qp_depth * rnicp->attr.max_qps;
+	rnicp->attr.rq_overflow_handled = 0;
+	rnicp->attr.can_modify_ird = 0;
+	rnicp->attr.can_modify_ord = 0;
+	rnicp->attr.max_mem_windows = rnicp->attr.max_mem_regs - 1;
+	rnicp->attr.stag0_value = 1;
+	rnicp->attr.zbva_support = 1;
+	rnicp->attr.local_invalidate_fence = 1;
+	rnicp->attr.cq_overflow_detection = 1;
+	return;
+}
+
+static void open_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *rnicp;
+	static int vers_printed;
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	if (!vers_printed++)
+		printk(KERN_INFO MOD "Chelsio T3 RDMA Driver - version %s\n",
+		       DRV_VERSION);
+	rnicp = (struct iwch_dev *)ib_alloc_device(sizeof(*rnicp));
+	if (!rnicp) {
+		printk(KERN_ERR MOD "Cannot allocate ib device\n");
+		return;
+	}
+	rnicp->rdev.ulp = rnicp;
+	rnicp->rdev.t3cdev_p = tdev;
+
+	if (cxio_rdev_open(&rnicp->rdev)) {
+		printk(KERN_ERR MOD "Unable to open CXIO rdev\n");
+		ib_dealloc_device(&rnicp->ibdev);
+		return;
+	}
+
+	rnic_init(rnicp);
+
+	mutex_lock(&dev_mutex);
+	list_add_tail(&rnicp->entry, &dev_list);
+	mutex_unlock(&dev_mutex);
+
+	if (iwch_register_device(rnicp)) {
+		printk(KERN_ERR MOD "Unable to register device\n");
+		close_rnic_dev(tdev);
+	}
+	printk(KERN_INFO MOD "Initialized device %s\n",
+	       pci_name(rnicp->rdev.rnic_info.pdev));
+	return;
+}
+
+static void close_rnic_dev(struct t3cdev *tdev)
+{
+	struct iwch_dev *dev, *tmp;
+	PDBG("%s t3cdev %p\n", __FUNCTION__,  tdev);
+	mutex_lock(&dev_mutex);
+	list_for_each_entry_safe(dev, tmp, &dev_list, entry) {
+		if (dev->rdev.t3cdev_p == tdev) {
+			list_del(&dev->entry);
+			iwch_unregister_device(dev);
+			cxio_rdev_close(&dev->rdev);
+			idr_destroy(&dev->cqidr);
+			idr_destroy(&dev->qpidr);
+			idr_destroy(&dev->mmidr);
+			ib_dealloc_device(&dev->ibdev);
+			break;
+		}
+	}
+	mutex_unlock(&dev_mutex);
+}
+
+extern void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb);
+
+static int __init iwch_init_module(void)
+{
+	int err;
+
+	err = cxio_hal_init();
+	if (err)
+		return err;
+	err = iwch_cm_init();
+	if (err)
+		return err;
+	cxio_register_ev_cb(iwch_ev_dispatch);
+	cxgb3_register_client(&t3c_client);
+	return 0;
+}
+
+static void __exit iwch_exit_module(void)
+{
+	cxgb3_unregister_client(&t3c_client);
+	cxio_unregister_ev_cb(iwch_ev_dispatch);
+	iwch_cm_term();
+	cxio_hal_exit();
+}
+
+module_init(iwch_init_module);
+module_exit(iwch_exit_module);
diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h
new file mode 100644
index 0000000..8b11198
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch.h
@@ -0,0 +1,175 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_H__
+#define __IWCH_H__
+
+#include <linux/mutex.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/idr.h>
+
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+
+struct iwch_pd;
+struct iwch_cq;
+struct iwch_qp;
+struct iwch_mr;
+
+struct iwch_rnic_attributes {
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 max_qps;
+	u32 max_wrs;				/* Max for any SQ/RQ */
+	u32 max_sge_per_wr;
+	u32 max_sge_per_rdma_write_wr;	/* for RDMA Write WR */
+	u32 max_cqs;
+	u32 max_cqes_per_cq;
+	u32 max_mem_regs;
+	u32 max_phys_buf_entries;		/* for phys buf list */
+	u32 max_pds;
+
+	/*
+	 * The memory page sizes supported by this RNIC.
+	 * Bit position i in bitmap indicates page of
+	 * size (4k)^i.  Phys block list mode unsupported.
+	 */
+	u32 mem_pgsizes_bitmask;
+	u8 can_resize_wq;
+
+	/*
+	 * The maximum number of RDMA Reads that can be outstanding
+	 * per QP with this RNIC as the target.
+	 */
+	u32 max_rdma_reads_per_qp;
+
+	/*
+	 * The maximum number of resources used for RDMA Reads
+	 * by this RNIC with this RNIC as the target.
+	 */
+	u32 max_rdma_read_resources;
+
+	/*
+	 * The max depth per QP for initiation of RDMA Read
+	 * by this RNIC.
+	 */
+	u32 max_rdma_read_qp_depth;
+
+	/*
+	 * The maximum depth for initiation of RDMA Read
+	 * operations by this RNIC on all QPs
+	 */
+	u32 max_rdma_read_depth;
+	u8 rq_overflow_handled;
+	u32 can_modify_ird;
+	u32 can_modify_ord;
+	u32 max_mem_windows;
+	u32 stag0_value;
+	u8 zbva_support;
+	u8 local_invalidate_fence;
+	u32 cq_overflow_detection;
+};
+
+struct iwch_dev {
+	struct ib_device ibdev;
+	struct cxio_rdev rdev;
+	u32 device_cap_flags;
+	struct iwch_rnic_attributes attr;
+	struct idr cqidr;
+	struct idr qpidr;
+	struct idr mmidr;
+	spinlock_t lock;
+	struct list_head entry;
+};
+
+static inline struct iwch_dev *to_iwch_dev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct iwch_dev, ibdev);
+}
+
+static inline int t3b_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3B);
+}
+
+static inline int t3a_device(const struct iwch_dev *rhp)
+{
+	return (rhp->rdev.t3cdev_p->type == T3A);
+}
+
+static inline struct iwch_cq *get_chp(struct iwch_dev *rhp, u32 cqid)
+{
+	return idr_find(&rhp->cqidr, cqid);
+}
+
+static inline struct iwch_qp *get_qhp(struct iwch_dev *rhp, u32 qpid)
+{
+	return idr_find(&rhp->qpidr, qpid);
+}
+
+static inline struct iwch_mr *get_mhp(struct iwch_dev *rhp, u32 mmid)
+{
+	return idr_find(&rhp->mmidr, mmid);
+}
+
+static inline int insert_handle(struct iwch_dev *rhp, struct idr *idr,
+				void *handle, u32 id)
+{
+	int ret;
+	u32 newid;
+
+	do {
+		if (!idr_pre_get(idr, GFP_KERNEL)) {
+			return -ENOMEM;
+		}
+		spin_lock_irq(&rhp->lock);
+		ret = idr_get_new_above(idr, handle, id, &newid);
+		BUG_ON(newid != id);
+		spin_unlock_irq(&rhp->lock);
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+static inline void remove_handle(struct iwch_dev *rhp, struct idr *idr, u32 id)
+{
+	spin_lock_irq(&rhp->lock);
+	idr_remove(idr, id);
+	spin_unlock_irq(&rhp->lock);
+}
+
+extern struct cxgb3_client t3c_client;
+extern cxgb3_cpl_handler_func t3c_handlers[NUM_CPL_CMDS];
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:19:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:19:25 -0600
Subject: [openib-general] [PATCH v5 03/13] iw_cxgb3 Provider Methods and
	Data Structures
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220191925.19316.38974.stgit@dell3.ogc.int>


Provider methods to support the Linux RDMA verbs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c | 1171 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  363 ++++++++
 drivers/infiniband/hw/cxgb3/iwch_user.h     |   68 ++
 3 files changed, 1602 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
new file mode 100644
index 0000000..ab99202
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -0,0 +1,1171 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/ethtool.h>
+
+#include <asm/io.h>
+#include <asm/irq.h>
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include <cxio_hal.h>
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+#include "iwch_user.h"
+
+static int iwch_modify_port(struct ib_device *ibdev,
+			    u8 port, int port_modify_mask,
+			    struct ib_port_modify *props)
+{
+	return -ENOSYS;
+}
+
+static struct ib_ah *iwch_ah_create(struct ib_pd *pd,
+				    struct ib_ah_attr *ah_attr)
+{
+	return ERR_PTR(-ENOSYS);
+}
+
+static int iwch_ah_destroy(struct ib_ah *ah)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_attach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_multicast_detach(struct ib_qp *ibqp, union ib_gid *gid, u16 lid)
+{
+	return -ENOSYS;
+}
+
+static int iwch_process_mad(struct ib_device *ibdev,
+			    int mad_flags,
+			    u8 port_num,
+			    struct ib_wc *in_wc,
+			    struct ib_grh *in_grh,
+			    struct ib_mad *in_mad, struct ib_mad *out_mad)
+{
+	return -ENOSYS;
+}
+
+static int iwch_dealloc_ucontext(struct ib_ucontext *context)
+{
+	struct iwch_dev *rhp = to_iwch_dev(context->device);
+	struct iwch_ucontext *ucontext = to_iwch_ucontext(context);
+	PDBG("%s context %p\n", __FUNCTION__, context);
+	cxio_release_ucontext(&rhp->rdev, &ucontext->uctx);
+	kfree(ucontext);
+	return 0;
+}
+
+static struct ib_ucontext *iwch_alloc_ucontext(struct ib_device *ibdev,
+					struct ib_udata *udata)
+{
+	struct iwch_ucontext *context;
+	struct iwch_dev *rhp = to_iwch_dev(ibdev);
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	context = kmalloc(sizeof(*context), GFP_KERNEL);
+	if (!context)
+		return ERR_PTR(-ENOMEM);
+	cxio_init_ucontext(&rhp->rdev, &context->uctx);
+	INIT_LIST_HEAD(&context->mmaps);
+	spin_lock_init(&context->mmap_lock);
+	return &context->ibucontext;
+}
+
+static int iwch_destroy_cq(struct ib_cq *ib_cq)
+{
+	struct iwch_cq *chp;
+
+	PDBG("%s ib_cq %p\n", __FUNCTION__, ib_cq);
+	chp = to_iwch_cq(ib_cq);
+
+	remove_handle(chp->rhp, &chp->rhp->cqidr, chp->cq.cqid);
+	atomic_dec(&chp->refcnt);
+	wait_event(chp->wait, !atomic_read(&chp->refcnt));
+
+	cxio_destroy_cq(&chp->rhp->rdev, &chp->cq);
+	kfree(chp);
+	return 0;
+}
+
+static struct ib_cq *iwch_create_cq(struct ib_device *ibdev, int entries,
+			     struct ib_ucontext *context,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	struct iwch_create_cq_resp uresp;
+
+	PDBG("%s ib_dev %p entries %d\n", __FUNCTION__, ibdev, entries);
+	rhp = to_iwch_dev(ibdev);
+	chp = kzalloc(sizeof(*chp), GFP_KERNEL);
+	if (!chp)
+		return ERR_PTR(-ENOMEM);
+
+	if (t3a_device(rhp)) {
+
+		/*
+		 * T3A: Add some fluff to handle extra CQEs inserted
+	 	 * for various errors.
+		 * Additional CQE possibilities:
+		 *      TERMINATE,
+		 *      incoming RDMA WRITE Failures
+		 *      incoming RDMA READ REQUEST FAILUREs
+		 * NOTE: We cannot ensure the CQ won't overflow.
+		 */
+		entries += 16;
+	}
+	entries = roundup_pow_of_two(entries);
+	chp->cq.size_log2 = ilog2(entries);
+
+	if (cxio_create_cq(&rhp->rdev, &chp->cq)) {
+		kfree(chp);
+		return ERR_PTR(-ENOMEM);
+	}
+	chp->rhp = rhp;
+	chp->ibcq.cqe = (1 << chp->cq.size_log2) - 1;
+	spin_lock_init(&chp->lock);
+	atomic_set(&chp->refcnt, 1);
+	init_waitqueue_head(&chp->wait);
+	insert_handle(rhp, &rhp->cqidr, chp, chp->cq.cqid);
+
+	if (context) {
+		struct iwch_mm_entry *mm;
+
+		mm = kmalloc(sizeof *mm, GFP_KERNEL);
+		if (!mm) {
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-ENOMEM);
+		}
+		uresp.cqid = chp->cq.cqid;
+		uresp.size_log2 = chp->cq.size_log2;
+		uresp.physaddr = virt_to_phys(chp->cq.queue);
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm);
+			iwch_destroy_cq(&chp->ibcq);
+			return ERR_PTR(-EFAULT);
+		}
+		mm->addr = uresp.physaddr;
+		mm->len = PAGE_ALIGN((1UL << uresp.size_log2) *
+					     sizeof (struct t3_cqe));
+		insert_mmap(to_iwch_ucontext(context), mm);
+	}
+	PDBG("created cqid 0x%0x chp %p size 0x%0x, dma_addr 0x%0llx\n",
+	     chp->cq.cqid, chp, (1 << chp->cq.size_log2),
+	     (u64)chp->cq.dma_addr);
+	return &chp->ibcq;
+}
+
+static int iwch_resize_cq(struct ib_cq *cq, int cqe, struct ib_udata *udata)
+{
+	struct iwch_cq *chp = to_iwch_cq(cq);
+	struct t3_cq oldcq, newcq;
+	int ret;
+
+	PDBG("%s ib_cq %p cqe %d\n", __FUNCTION__, cq, cqe);
+
+	/* We don't downsize... */
+	if (cqe <= cq->cqe)
+		return 0;
+
+	/* create new t3_cq with new size */
+	cqe = roundup_pow_of_two(cqe+1);
+	newcq.size_log2 = ilog2(cqe);
+
+	/* Dont allow resize to less than the current wce count */
+	if (cqe < Q_COUNT(chp->cq.rptr, chp->cq.wptr)) {
+		return -ENOMEM;
+	}
+
+	/* Quiesce all QPs using this CQ */
+	ret = iwch_quiesce_qps(chp);
+	if (ret) {
+		return ret;
+	}
+
+	ret = cxio_create_cq(&chp->rhp->rdev, &newcq);
+	if (ret) {
+		kfree(chp);
+		return ret;
+	}
+	
+	/* copy CQEs */
+	memcpy(newcq.queue, chp->cq.queue, (1 << chp->cq.size_log2) *
+				        sizeof(struct t3_cqe));
+
+	/* old iwch_qp gets new t3_cq but keeps old cqid */
+	oldcq = chp->cq;
+	chp->cq = newcq;
+	chp->cq.cqid = oldcq.cqid;
+
+	/* resize new t3_cq to update the HW context */
+	ret = cxio_resize_cq(&chp->rhp->rdev, &chp->cq);
+	if (ret) {
+		chp->cq = oldcq;
+		return ret;
+	}
+	chp->ibcq.cqe = (1<<chp->cq.size_log2) - 1;
+
+	/* destroy old t3_cq */
+	oldcq.cqid = newcq.cqid;
+	ret = cxio_destroy_cq(&chp->rhp->rdev, &oldcq);
+	if (ret) {
+		printk(KERN_ERR MOD "%s - cxio_destroy_cq failed %d\n",
+			__FUNCTION__, ret);
+	}
+	
+	/* add user hooks here */
+
+	/* resume qps */
+	ret = iwch_resume_qps(chp);
+	return ret;
+}
+
+static int iwch_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
+		       struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	enum t3_cq_opcode cq_op;
+	int err;
+	unsigned long flag;
+	struct iwch_req_notify_cq ucmd;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+	if (notify == IB_CQ_SOLICITED)
+		cq_op = CQ_ARM_SE;
+	else
+		cq_op = CQ_ARM_AN;
+	if (udata && t3b_device(rhp)) {
+		if (ib_copy_from_udata(&ucmd, udata, sizeof ucmd))
+			return -EFAULT;
+		spin_lock_irqsave(&chp->lock, flag);
+		chp->cq.rptr = ucmd.rptr;
+	} else
+		spin_lock_irqsave(&chp->lock, flag);
+	PDBG("%s rptr 0x%x\n", __FUNCTION__, chp->cq.rptr);
+	err = cxio_hal_cq_op(&rhp->rdev, &chp->cq, cq_op, 0);
+	spin_unlock_irqrestore(&chp->lock, flag);
+	if (err)
+		printk(KERN_ERR MOD "Error %d rearming CQID 0x%x\n", err,
+		       chp->cq.cqid);
+	return err;
+}
+
+static int iwch_mmap(struct ib_ucontext *context, struct vm_area_struct *vma)
+{
+	int len = vma->vm_end - vma->vm_start;
+	u64 pgaddr = vma->vm_pgoff << PAGE_SHIFT;
+	struct cxio_rdev *rdev_p;
+	int ret = 0;
+	struct iwch_mm_entry *mm;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s off 0x%lx addr 0x%llx len %d\n", __FUNCTION__, vma->vm_pgoff,
+	     pgaddr, len);
+
+	if (vma->vm_start & (PAGE_SIZE-1)) {
+                return -EINVAL;
+        }
+
+	rdev_p = &(to_iwch_dev(context->device)->rdev);
+	ucontext = to_iwch_ucontext(context);
+
+	mm = remove_mmap(ucontext, pgaddr, len);
+	if (!mm)
+		return -EINVAL;
+	kfree(mm);
+
+	if ((pgaddr >= rdev_p->rnic_info.udbell_physbase) &&
+	    (pgaddr < (rdev_p->rnic_info.udbell_physbase +
+		       rdev_p->rnic_info.udbell_len))) {
+
+		/*
+		 * Map T3 DB register.
+		 */
+		if (vma->vm_flags & VM_READ) {
+                	return -EPERM;
+		}
+
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		vma->vm_flags |= VM_DONTCOPY | VM_DONTEXPAND;
+		vma->vm_flags &= ~VM_MAYREAD;
+		ret = io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	} else {
+
+		/*
+		 * Map WQ or CQ contig dma memory...
+		 */
+		ret = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+				       len, vma->vm_page_prot);
+	}
+	
+	return ret;
+}
+
+static int iwch_deallocate_pd(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	PDBG("%s ibpd %p pdid 0x%x\n", __FUNCTION__, pd, php->pdid);
+	cxio_hal_put_pdid(rhp->rdev.rscp, php->pdid);
+	kfree(php);
+	return 0;
+}
+
+static struct ib_pd *iwch_allocate_pd(struct ib_device *ibdev,
+			       struct ib_ucontext *context,
+			       struct ib_udata *udata)
+{
+	struct iwch_pd *php;
+	u32 pdid;
+	struct iwch_dev *rhp;
+
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	rhp = (struct iwch_dev *) ibdev;
+	pdid = cxio_hal_get_pdid(rhp->rdev.rscp);
+	if (!pdid)
+		return ERR_PTR(-EINVAL);
+	php = kzalloc(sizeof(*php), GFP_KERNEL);
+	if (!php) {
+		cxio_hal_put_pdid(rhp->rdev.rscp, pdid);
+		return ERR_PTR(-ENOMEM);
+	}
+	php->pdid = pdid;
+	php->rhp = rhp;
+	if (context) {
+		if (ib_copy_to_udata(udata, &php->pdid, sizeof (__u32))) {
+			iwch_deallocate_pd(&php->ibpd);
+			return ERR_PTR(-EFAULT);
+		}
+	}
+	PDBG("%s pdid 0x%0x ptr 0x%p\n", __FUNCTION__, pdid, php);
+	return &php->ibpd;
+}
+
+static int iwch_dereg_mr(struct ib_mr *ib_mr)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mr *mhp;
+	u32 mmid;
+
+	PDBG("%s ib_mr %p\n", __FUNCTION__, ib_mr);
+	/* There can be no memory windows */
+	if (atomic_read(&ib_mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(ib_mr);
+	rhp = mhp->rhp;
+	mmid = mhp->attr.stag >> 8;
+	cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
+		       mhp->attr.pbl_addr);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	if (mhp->kva)
+		kfree((void *) (unsigned long) mhp->kva);
+	PDBG("%s mmid 0x%x ptr %p\n", __FUNCTION__, mmid, mhp);
+	kfree(mhp);
+	return 0;
+}
+
+static struct ib_mr *iwch_register_phys_mem(struct ib_pd *pd,
+					struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					int acc,
+					u64 *iova_start)
+{
+	__be64 *page_list;
+	int shift;
+	u64 total_size;
+	int npages;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	int ret;
+		
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+
+	acc = iwch_convert_access(acc);
+
+	
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	/* First check that we have enough alignment */
+	if ((*iova_start & ~PAGE_MASK) != (buffer_list[0].addr & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	if (num_phys_buf > 1 &&
+	    ((buffer_list[0].addr + buffer_list[0].size) & ~PAGE_MASK)) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ret = build_phys_page_list(buffer_list, num_phys_buf, iova_start,
+			 	   &total_size, &npages, &shift, &page_list);
+	if (ret)
+		goto err;
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+
+	/* NOTE: TPT perms are backwards from BIND WR perms! */
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+
+	mhp->attr.va_fbo = *iova_start;
+	mhp->attr.page_size = shift - 12;
+
+	mhp->attr.len = (u32) total_size;
+	mhp->attr.pbl_size = npages;
+	ret = iwch_register_mem(rhp, php, mhp, shift, page_list);
+	kfree(page_list);
+	if (ret) {
+		goto err;
+	}
+	return &mhp->ibmr;
+err:
+	kfree(mhp);
+	return ERR_PTR(ret);
+	
+}
+
+static int iwch_reregister_phys_mem(struct ib_mr *mr,
+				     int mr_rereg_mask,
+				     struct ib_pd *pd,
+                                     struct ib_phys_buf *buffer_list,
+                                     int num_phys_buf,
+                                     int acc, u64 * iova_start)
+{
+
+	struct iwch_mr mh, *mhp;
+	struct iwch_pd *php;
+	struct iwch_dev *rhp;
+	int new_acc;
+	__be64 *page_list = NULL;
+	int shift = 0;
+	u64 total_size;
+	int npages;
+	int ret;
+
+	PDBG("%s ib_mr %p ib_pd %p\n", __FUNCTION__, mr, pd);
+
+	/* There can be no memory windows */
+	if (atomic_read(&mr->usecnt))
+		return -EINVAL;
+
+	mhp = to_iwch_mr(mr);
+	rhp = mhp->rhp;
+	php = to_iwch_pd(mr->pd);
+
+	/* make sure we are on the same adapter */
+	if (rhp != php->rhp)
+		return -EINVAL;
+
+	new_acc = mhp->attr.perms;
+
+	memcpy(&mh, mhp, sizeof *mhp);
+
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		php = to_iwch_pd(pd);
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mh.attr.perms = iwch_convert_access(acc);
+	if (mr_rereg_mask & IB_MR_REREG_TRANS)
+		ret = build_phys_page_list(buffer_list, num_phys_buf,
+					   iova_start,
+					   &total_size, &npages,
+					   &shift, &page_list);
+
+	ret = iwch_reregister_mem(rhp, php, &mh, shift, page_list, npages);
+	kfree(page_list);
+	if (ret) {
+		return ret;
+	}
+	if (mr_rereg_mask & IB_MR_REREG_PD)
+		mhp->attr.pdid = php->pdid;
+	if (mr_rereg_mask & IB_MR_REREG_ACCESS)
+		mhp->attr.perms = acc;
+	if (mr_rereg_mask & IB_MR_REREG_TRANS) {
+		mhp->attr.zbva = 0;
+		mhp->attr.va_fbo = *iova_start;
+		mhp->attr.page_size = shift - 12;
+		mhp->attr.len = (u32) total_size;
+		mhp->attr.pbl_size = npages;
+	}
+
+	return 0;	
+}
+
+
+struct ib_mr *iwch_reg_user_mr(struct ib_pd *pd, struct ib_umem *region,
+				      int acc, struct ib_udata *udata)
+{
+	__be64 *pages;
+	int shift, n, len;
+	int i, j, k;
+	int err = 0;
+	struct ib_umem_chunk *chunk;
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mr *mhp;
+	struct iwch_reg_user_mr_resp uresp;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	shift = ffs(region->page_size) - 1;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+
+	n = 0;
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		n += chunk->nents;
+
+	pages = kmalloc(n * sizeof(u64), GFP_KERNEL);
+	if (!pages) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	acc = iwch_convert_access(acc);
+
+	i = n = 0;
+
+	list_for_each_entry(chunk, &region->chunk_list, list)
+		for (j = 0; j < chunk->nmap; ++j) {
+			len = sg_dma_len(&chunk->page_list[j]) >> shift;
+			for (k = 0; k < len; ++k) {
+				pages[i++] = cpu_to_be64(sg_dma_address(
+					&chunk->page_list[j]) +
+					region->page_size * k);
+			}
+		}
+
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.zbva = 0;
+	mhp->attr.perms = (acc & 0x1) << 3;
+	mhp->attr.perms |= (acc & 0x2) << 1;
+	mhp->attr.perms |= (acc & 0x4) >> 1;
+	mhp->attr.perms |= (acc & 0x8) >> 3;
+	mhp->attr.va_fbo = region->virt_base;
+	mhp->attr.page_size = shift - 12;
+	mhp->attr.len = (u32) region->length;
+	mhp->attr.pbl_size = i;
+	err = iwch_register_mem(rhp, php, mhp, shift, pages);
+	kfree(pages);
+	if (err)
+		goto err;
+
+	if (udata && t3b_device(rhp)) {
+		uresp.pbl_addr = (mhp->attr.pbl_addr -
+                                 rhp->rdev.rnic_info.pbl_base) >> 3;
+		PDBG("%s user resp pbl_addr 0x%x\n", __FUNCTION__,
+		     uresp.pbl_addr);
+			
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			iwch_dereg_mr(&mhp->ibmr);
+			err = -EFAULT;
+			goto err;
+		}
+	}
+
+	return &mhp->ibmr;
+
+err:
+	kfree(mhp);
+	return ERR_PTR(err);
+}
+
+struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int acc)
+{
+	struct ib_phys_buf bl;
+	u64 kva;
+	struct ib_mr *ibmr;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+
+	/*
+	 * T3 only supports 32 bits of size.
+	 */
+	bl.size = 0xffffffff;
+	bl.addr = 0;
+	kva = 0;
+	ibmr = iwch_register_phys_mem(pd, &bl, 1, acc, &kva);
+	return ibmr;
+}
+
+struct ib_mw *iwch_alloc_mw(struct ib_pd *pd)
+{
+	struct iwch_dev *rhp;
+	struct iwch_pd *php;
+	struct iwch_mw *mhp;
+	u32 mmid;
+	u32 stag = 0;
+	int ret;
+
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	mhp = kzalloc(sizeof(*mhp), GFP_KERNEL);
+	if (!mhp)
+		return ERR_PTR(-ENOMEM);
+	ret = cxio_allocate_window(&rhp->rdev, &stag, php->pdid);
+	if (ret) {
+		kfree(mhp);
+		return ERR_PTR(ret);
+	}
+	mhp->rhp = rhp;
+	mhp->attr.pdid = php->pdid;
+	mhp->attr.type = TPT_MW;
+	mhp->attr.stag = stag;
+	mmid = (stag) >> 8;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p stag 0x%x\n", __FUNCTION__, mmid, mhp, stag);
+	return &(mhp->ibmw);
+}
+
+int iwch_dealloc_mw(struct ib_mw *mw)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	u32 mmid;
+
+	mhp = to_iwch_mw(mw);
+	rhp = mhp->rhp;
+	mmid = (mw->rkey) >> 8;
+	cxio_deallocate_window(&rhp->rdev, mhp->attr.stag);
+	remove_handle(rhp, &rhp->mmidr, mmid);
+	kfree(mhp);
+	PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __FUNCTION__, mw, mmid, mhp);
+	return 0;
+}
+
+static int iwch_destroy_qp(struct ib_qp *ib_qp)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_qp_attributes attrs;
+	struct iwch_ucontext *ucontext;
+
+	qhp = to_iwch_qp(ib_qp);
+	rhp = qhp->rhp;
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(rhp, qhp, IWCH_QP_ATTR_NEXT_STATE, &attrs, 0);
+	}
+	wait_event(qhp->wait, !qhp->ep);
+
+	remove_handle(rhp, &rhp->qpidr, qhp->wq.qpid);
+
+	atomic_dec(&qhp->refcnt);
+	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
+
+	ucontext = ib_qp->uobject ? to_iwch_ucontext(ib_qp->uobject->context)
+				  : NULL;
+	cxio_destroy_qp(&rhp->rdev, &qhp->wq,
+			ucontext ? &ucontext->uctx : &rhp->rdev.uctx);
+
+	PDBG("%s ib_qp %p qpid 0x%0x qhp %p\n", __FUNCTION__,
+	     ib_qp, qhp->wq.qpid, qhp);
+	kfree(qhp);
+	return 0;
+}
+
+static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
+			     struct ib_qp_init_attr *attrs,
+			     struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	struct iwch_pd *php;
+	struct iwch_cq *schp;
+	struct iwch_cq *rchp;
+	struct iwch_create_qp_resp uresp;
+	int wqsize, sqsize, rqsize;
+	struct iwch_ucontext *ucontext;
+
+	PDBG("%s ib_pd %p\n", __FUNCTION__, pd);
+	if (attrs->qp_type != IB_QPT_RC)
+		return ERR_PTR(-EINVAL);
+	php = to_iwch_pd(pd);
+	rhp = php->rhp;
+	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
+	rchp = get_chp(rhp, ((struct iwch_cq *) attrs->recv_cq)->cq.cqid);
+	if (!schp || !rchp)
+		return ERR_PTR(-EINVAL);
+
+	/* The RQT size must be # of entries + 1 rounded up to a power of two */
+	rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr);
+	if (rqsize == attrs->cap.max_recv_wr)
+		rqsize = roundup_pow_of_two(attrs->cap.max_recv_wr+1);
+
+	/* T3 doesn't support RQT depth < 16 */
+	if (rqsize < 16)
+		rqsize = 16;
+
+	if (rqsize > T3_MAX_RQ_SIZE)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * NOTE: The SQ and total WQ sizes don't need to be
+	 * a power of two.  However, all the code assumes
+	 * they are. EG: Q_FREECNT() and friends.
+	 */
+	sqsize = roundup_pow_of_two(attrs->cap.max_send_wr);
+	wqsize = roundup_pow_of_two(rqsize + sqsize);
+	PDBG("%s wqsize %d sqsize %d rqsize %d\n", __FUNCTION__,
+	     wqsize, sqsize, rqsize);
+	qhp = kzalloc(sizeof(*qhp), GFP_KERNEL);
+	if (!qhp)
+		return ERR_PTR(-ENOMEM);
+	qhp->wq.size_log2 = ilog2(wqsize);
+	qhp->wq.rq_size_log2 = ilog2(rqsize);
+	qhp->wq.sq_size_log2 = ilog2(sqsize);
+	ucontext = pd->uobject ? to_iwch_ucontext(pd->uobject->context) : NULL;
+	if (cxio_create_qp(&rhp->rdev, !udata, &qhp->wq,
+			   ucontext ? &ucontext->uctx : &rhp->rdev.uctx)) {
+		kfree(qhp);
+		return ERR_PTR(-ENOMEM);
+	}
+	attrs->cap.max_recv_wr = rqsize - 1;
+	attrs->cap.max_send_wr = sqsize;
+	qhp->rhp = rhp;
+	qhp->attr.pd = php->pdid;
+	qhp->attr.scq = ((struct iwch_cq *) attrs->send_cq)->cq.cqid;
+	qhp->attr.rcq = ((struct iwch_cq *) attrs->recv_cq)->cq.cqid;
+	qhp->attr.sq_num_entries = attrs->cap.max_send_wr;
+	qhp->attr.rq_num_entries = attrs->cap.max_recv_wr;
+	qhp->attr.sq_max_sges = attrs->cap.max_send_sge;
+	qhp->attr.sq_max_sges_rdma_write = attrs->cap.max_send_sge;
+	qhp->attr.rq_max_sges = attrs->cap.max_recv_sge;
+	qhp->attr.state = IWCH_QP_STATE_IDLE;
+	qhp->attr.next_state = IWCH_QP_STATE_IDLE;
+
+	/*
+	 * XXX - These don't get passed in from the openib user
+ 	 * at create time.  The CM sets them via a QP modify.
+	 * Need to fix...  I think the CM should
+	 */
+	qhp->attr.enable_rdma_read = 1;
+	qhp->attr.enable_rdma_write = 1;
+	qhp->attr.enable_bind = 1;
+	qhp->attr.max_ord = 1;
+	qhp->attr.max_ird = 1;
+
+	spin_lock_init(&qhp->lock);
+	init_waitqueue_head(&qhp->wait);
+	atomic_set(&qhp->refcnt, 1);
+	insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.qpid);
+
+	if (udata) {
+
+		struct iwch_mm_entry *mm1, *mm2;
+
+		mm1 = kmalloc(sizeof *mm1, GFP_KERNEL);
+		if (!mm1) {
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		mm2 = kmalloc(sizeof *mm2, GFP_KERNEL);
+		if (!mm2) {
+			kfree(mm1);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-ENOMEM);
+		}
+			
+		uresp.qpid = qhp->wq.qpid;
+		uresp.size_log2 = qhp->wq.size_log2;
+		uresp.sq_size_log2 = qhp->wq.sq_size_log2;
+		uresp.rq_size_log2 = qhp->wq.rq_size_log2;
+		uresp.physaddr = virt_to_phys(qhp->wq.queue);
+		uresp.doorbell = qhp->wq.udb;
+		if (ib_copy_to_udata(udata, &uresp, sizeof (uresp))) {
+			kfree(mm1);
+			kfree(mm2);
+			iwch_destroy_qp(&qhp->ibqp);
+			return ERR_PTR(-EFAULT);
+		}
+		mm1->addr = uresp.physaddr;
+		mm1->len = PAGE_ALIGN(wqsize * sizeof (union t3_wr));
+		insert_mmap(ucontext, mm1);
+		mm2->addr = uresp.doorbell & PAGE_MASK;
+		mm2->len = PAGE_SIZE;
+		insert_mmap(ucontext, mm2);
+	}
+	qhp->ibqp.qp_num = qhp->wq.qpid;
+	init_timer(&(qhp->timer));
+	PDBG("%s sq_num_entries %d, rq_num_entries %d "
+	     "qpid 0x%0x qhp %p dma_addr 0x%llx size %d\n",
+	     __FUNCTION__, qhp->attr.sq_num_entries, qhp->attr.rq_num_entries,
+	     qhp->wq.qpid, qhp, (u64)qhp->wq.dma_addr, 1 << qhp->wq.size_log2);
+	return (&qhp->ibqp);
+}
+
+static int iwch_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		      int attr_mask, struct ib_udata *udata)
+{
+	struct iwch_dev *rhp;
+	struct iwch_qp *qhp;
+	enum iwch_qp_attr_mask mask = 0;
+	struct iwch_qp_attributes attrs;
+
+	PDBG("%s ib_qp %p\n", __FUNCTION__, ibqp);
+
+	/* iwarp does not support the RTR state */
+	if ((attr_mask & IB_QP_STATE) && (attr->qp_state == IB_QPS_RTR))
+		attr_mask &= ~IB_QP_STATE;
+
+	/* Make sure we still have something left to do */
+	if (!attr_mask)
+		return 0;
+
+	memset(&attrs, 0, sizeof attrs);
+	qhp = to_iwch_qp(ibqp);
+	rhp = qhp->rhp;
+
+	attrs.next_state = iwch_convert_state(attr->qp_state);
+	attrs.enable_rdma_read = (attr->qp_access_flags &
+			       IB_ACCESS_REMOTE_READ) ?  1 : 0;
+	attrs.enable_rdma_write = (attr->qp_access_flags &
+				IB_ACCESS_REMOTE_WRITE) ? 1 : 0;
+	attrs.enable_bind = (attr->qp_access_flags & IB_ACCESS_MW_BIND) ? 1 : 0;
+
+
+	mask |= (attr_mask & IB_QP_STATE) ? IWCH_QP_ATTR_NEXT_STATE : 0;
+	mask |= (attr_mask & IB_QP_ACCESS_FLAGS) ?
+			(IWCH_QP_ATTR_ENABLE_RDMA_READ |
+			 IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+			 IWCH_QP_ATTR_ENABLE_RDMA_BIND) : 0;
+
+	return iwch_modify_qp(rhp, qhp, mask, &attrs, 0);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	atomic_inc(&(to_iwch_qp(qp)->refcnt));
+}
+
+void iwch_qp_rem_ref(struct ib_qp *qp)
+{
+	PDBG("%s ib_qp %p\n", __FUNCTION__, qp);
+	if (atomic_dec_and_test(&(to_iwch_qp(qp)->refcnt)))
+                wake_up(&(to_iwch_qp(qp)->wait));
+}
+
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn)
+{
+	PDBG("%s ib_dev %p qpn 0x%x\n", __FUNCTION__, dev, qpn);
+	return (struct ib_qp *)get_qhp(to_iwch_dev(dev), qpn);
+}
+
+
+static int iwch_query_pkey(struct ib_device *ibdev,
+			   u8 port, u16 index, u16 * pkey)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	*pkey = 0;
+	return 0;
+}
+
+static int iwch_query_gid(struct ib_device *ibdev, u8 port,
+			  int index, union ib_gid *gid)
+{
+	struct iwch_dev *dev;
+
+	PDBG("%s ibdev %p, port %d, index %d, gid %p\n",
+	       __FUNCTION__, ibdev, port, index, gid);
+	dev = to_iwch_dev(ibdev);
+	BUG_ON(port == 0 || port > 2);
+	memset(&(gid->raw[0]), 0, sizeof(gid->raw));
+	memcpy(&(gid->raw[0]), dev->rdev.port_info.lldevs[port-1]->dev_addr, 6);
+	return 0;
+}
+
+static int iwch_query_device(struct ib_device *ibdev,
+			     struct ib_device_attr *props)
+{
+
+	struct iwch_dev *dev;
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+
+	dev = to_iwch_dev(ibdev);
+	memset(props, 0, sizeof *props);
+	memcpy(&props->sys_image_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	props->device_cap_flags = dev->device_cap_flags;
+	props->vendor_id = (u32)dev->rdev.rnic_info.pdev->vendor;
+	props->vendor_part_id = (u32)dev->rdev.rnic_info.pdev->device;
+	props->max_mr_size = ~0ull;
+	props->max_qp = dev->attr.max_qps;
+	props->max_qp_wr = dev->attr.max_wrs;
+	props->max_sge = dev->attr.max_sge_per_wr;
+	props->max_sge_rd = 1;
+	props->max_qp_rd_atom = dev->attr.max_rdma_reads_per_qp;
+	props->max_cq = dev->attr.max_cqs;
+	props->max_cqe = dev->attr.max_cqes_per_cq;
+	props->max_mr = dev->attr.max_mem_regs;
+	props->max_pd = dev->attr.max_pds;
+	props->local_ca_ack_delay = 0;
+
+	return 0;
+}
+
+static int iwch_query_port(struct ib_device *ibdev,
+			   u8 port, struct ib_port_attr *props)
+{
+	PDBG("%s ibdev %p\n", __FUNCTION__, ibdev);
+	props->max_mtu = IB_MTU_4096;
+	props->lid = 0;
+	props->lmc = 0;
+	props->sm_lid = 0;
+	props->sm_sl = 0;
+	props->state = IB_PORT_ACTIVE;
+	props->phys_state = 0;
+	props->port_cap_flags =
+	    IB_PORT_CM_SUP |
+	    IB_PORT_SNMP_TUNNEL_SUP |
+	    IB_PORT_REINIT_SUP |
+	    IB_PORT_DEVICE_MGMT_SUP |
+	    IB_PORT_VENDOR_CLASS_SUP | IB_PORT_BOOT_MGMT_SUP;
+	props->gid_tbl_len = 1;
+	props->pkey_tbl_len = 1;
+	props->qkey_viol_cntr = 0;
+	props->active_width = 2;
+	props->active_speed = 2;
+	props->max_msg_sz = -1;
+
+	return 0;
+}
+
+static ssize_t show_rev(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev,
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	return sprintf(buf, "%d\n", dev->rdev.t3cdev_p->type);
+}
+
+static ssize_t show_fw_ver(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev,
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.fw_version);
+}
+
+static ssize_t show_hca(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev,
+					    ibdev.class_dev);
+	struct ethtool_drvinfo info;
+	struct net_device *lldev = dev->rdev.t3cdev_p->lldev;
+
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, cdev);
+	lldev->ethtool_ops->get_drvinfo(lldev, &info);
+	return sprintf(buf, "%s\n", info.driver);
+}
+
+static ssize_t show_board(struct class_device *cdev, char *buf)
+{
+	struct iwch_dev *dev = container_of(cdev, struct iwch_dev,
+					    ibdev.class_dev);
+	PDBG("%s class dev 0x%p\n", __FUNCTION__, dev);
+	return sprintf(buf, "%x.%x\n", dev->rdev.rnic_info.pdev->vendor,
+		                       dev->rdev.rnic_info.pdev->device);
+}
+
+static CLASS_DEVICE_ATTR(hw_rev, S_IRUGO, show_rev, NULL);
+static CLASS_DEVICE_ATTR(fw_ver, S_IRUGO, show_fw_ver, NULL);
+static CLASS_DEVICE_ATTR(hca_type, S_IRUGO, show_hca, NULL);
+static CLASS_DEVICE_ATTR(board_id, S_IRUGO, show_board, NULL);
+
+static struct class_device_attribute *iwch_class_attributes[] = {
+	&class_device_attr_hw_rev,
+	&class_device_attr_fw_ver,
+	&class_device_attr_hca_type,
+	&class_device_attr_board_id
+};
+
+int iwch_register_device(struct iwch_dev *dev)
+{
+	int ret;
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
+	memset(&dev->ibdev.node_guid, 0, sizeof(dev->ibdev.node_guid));
+	memcpy(&dev->ibdev.node_guid, dev->rdev.t3cdev_p->lldev->dev_addr, 6);
+	dev->ibdev.owner = THIS_MODULE;
+	dev->device_cap_flags =
+	    (IB_DEVICE_ZERO_STAG |
+	     IB_DEVICE_SEND_W_INV | IB_DEVICE_MEM_WINDOW);
+
+	dev->ibdev.uverbs_cmd_mask =
+	    (1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+	    (1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+	    (1ull << IB_USER_VERBS_CMD_REG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
+	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
+	    (1ull << IB_USER_VERBS_CMD_POST_RECV);
+	dev->ibdev.node_type = RDMA_NODE_RNIC;
+	memcpy(dev->ibdev.node_desc, IWCH_NODE_DESC, sizeof(IWCH_NODE_DESC));
+	dev->ibdev.phys_port_cnt = dev->rdev.port_info.nports;
+	dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.class_dev.dev = &(dev->rdev.rnic_info.pdev->dev);
+	dev->ibdev.query_device = iwch_query_device;
+	dev->ibdev.query_port = iwch_query_port;
+	dev->ibdev.modify_port = iwch_modify_port;
+	dev->ibdev.query_pkey = iwch_query_pkey;
+	dev->ibdev.query_gid = iwch_query_gid;
+	dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
+	dev->ibdev.dealloc_ucontext = iwch_dealloc_ucontext;
+	dev->ibdev.mmap = iwch_mmap;
+	dev->ibdev.alloc_pd = iwch_allocate_pd;
+	dev->ibdev.dealloc_pd = iwch_deallocate_pd;
+	dev->ibdev.create_ah = iwch_ah_create;
+	dev->ibdev.destroy_ah = iwch_ah_destroy;
+	dev->ibdev.create_qp = iwch_create_qp;
+	dev->ibdev.modify_qp = iwch_ib_modify_qp;
+	dev->ibdev.destroy_qp = iwch_destroy_qp;
+	dev->ibdev.create_cq = iwch_create_cq;
+	dev->ibdev.destroy_cq = iwch_destroy_cq;
+	dev->ibdev.resize_cq = iwch_resize_cq;
+	dev->ibdev.poll_cq = iwch_poll_cq;
+	dev->ibdev.get_dma_mr = iwch_get_dma_mr;
+	dev->ibdev.reg_phys_mr = iwch_register_phys_mem;
+	dev->ibdev.rereg_phys_mr = iwch_reregister_phys_mem;
+	dev->ibdev.reg_user_mr = iwch_reg_user_mr;
+	dev->ibdev.dereg_mr = iwch_dereg_mr;
+	dev->ibdev.alloc_mw = iwch_alloc_mw;
+	dev->ibdev.bind_mw = iwch_bind_mw;
+	dev->ibdev.dealloc_mw = iwch_dealloc_mw;
+
+	dev->ibdev.attach_mcast = iwch_multicast_attach;
+	dev->ibdev.detach_mcast = iwch_multicast_detach;
+	dev->ibdev.process_mad = iwch_process_mad;
+
+	dev->ibdev.req_notify_cq = iwch_arm_cq;
+	dev->ibdev.post_send = iwch_post_send;
+	dev->ibdev.post_recv = iwch_post_receive;
+
+
+	dev->ibdev.iwcm =
+	    (struct iw_cm_verbs *) kmalloc(sizeof(struct iw_cm_verbs),
+					   GFP_KERNEL);
+	dev->ibdev.iwcm->connect = iwch_connect;
+	dev->ibdev.iwcm->accept = iwch_accept_cr;
+	dev->ibdev.iwcm->reject = iwch_reject_cr;
+	dev->ibdev.iwcm->create_listen = iwch_create_listen;
+	dev->ibdev.iwcm->destroy_listen = iwch_destroy_listen;
+	dev->ibdev.iwcm->add_ref = iwch_qp_add_ref;
+	dev->ibdev.iwcm->rem_ref = iwch_qp_rem_ref;
+	dev->ibdev.iwcm->get_qp = iwch_get_qp;
+
+	ret = ib_register_device(&dev->ibdev);
+	if (ret)
+		goto bail1;
+
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i) {
+		ret = class_device_create_file(&dev->ibdev.class_dev,
+					       iwch_class_attributes[i]);
+		if (ret) {
+			goto bail2;
+		}
+	}
+	return 0;
+bail2:
+	ib_unregister_device(&dev->ibdev);
+bail1:
+	return ret;
+}
+
+void iwch_unregister_device(struct iwch_dev *dev)
+{
+	int i;
+
+	PDBG("%s iwch_dev %p\n", __FUNCTION__, dev);
+	for (i = 0; i < ARRAY_SIZE(iwch_class_attributes); ++i)
+		class_device_remove_file(&dev->ibdev.class_dev,
+					 iwch_class_attributes[i]);
+	ib_unregister_device(&dev->ibdev);
+	return;
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h b/drivers/infiniband/hw/cxgb3/iwch_provider.h
new file mode 100644
index 0000000..f339427
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -0,0 +1,363 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_PROVIDER_H__
+#define __IWCH_PROVIDER_H__
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <rdma/ib_verbs.h>
+#include <asm/types.h>
+#include "t3cdev.h"
+#include "iwch.h"
+#include "cxio_wr.h"
+#include "cxio_hal.h"
+
+struct iwch_pd {
+	struct ib_pd ibpd;
+	u32 pdid;
+	struct iwch_dev *rhp;
+};
+
+static inline struct iwch_pd *to_iwch_pd(struct ib_pd *ibpd)
+{
+	return container_of(ibpd, struct iwch_pd, ibpd);
+}
+
+struct tpt_attributes {
+	u32 stag;
+	u32 state:1;
+	u32 type:2;
+	u32 rsvd:1;
+	enum tpt_mem_perm perms;
+	u32 remote_invaliate_disable:1;
+	u32 zbva:1;
+	u32 mw_bind_enable:1;
+	u32 page_size:5;
+
+	u32 pdid;
+	u32 qpid;
+	u32 pbl_addr;
+	u32 len;
+	u64 va_fbo;
+	u32 pbl_size;
+};
+
+struct iwch_mr {
+	struct ib_mr ibmr;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+typedef struct iwch_mw iwch_mw_handle;
+
+static inline struct iwch_mr *to_iwch_mr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct iwch_mr, ibmr);
+}
+
+struct iwch_mw {
+	struct ib_mw ibmw;
+	struct iwch_dev *rhp;
+	u64 kva;
+	struct tpt_attributes attr;
+};
+
+static inline struct iwch_mw *to_iwch_mw(struct ib_mw *ibmw)
+{
+	return container_of(ibmw, struct iwch_mw, ibmw);
+}
+
+struct iwch_cq {
+	struct ib_cq ibcq;
+	struct iwch_dev *rhp;
+	struct t3_cq cq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+};
+
+static inline struct iwch_cq *to_iwch_cq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct iwch_cq, ibcq);
+}
+
+enum IWCH_QP_FLAGS {
+	QP_QUIESCED = 0x01
+};
+
+struct iwch_mpa_attributes {
+	u8 recv_marker_enabled;
+	u8 xmit_marker_enabled;	/* iWARP: enable inbound Read Resp. */
+	u8 crc_enabled;
+	u8 version;	/* 0 or 1 */
+};
+
+struct iwch_qp_attributes {
+	u32 scq;
+	u32 rcq;
+	u32 sq_num_entries;
+	u32 rq_num_entries;
+	u32 sq_max_sges;
+	u32 sq_max_sges_rdma_write;
+	u32 rq_max_sges;
+	u32 state;
+	u8 enable_rdma_read;
+	u8 enable_rdma_write;	/* enable inbound Read Resp. */
+	u8 enable_bind;
+	u8 enable_mmid0_fastreg;	/* Enable STAG0 + Fast-register */
+	/*
+	 * Next QP state. If specify the current state, only the
+	 * QP attributes will be modified.
+	 */
+	u32 max_ord;
+	u32 max_ird;
+	u32 pd;	/* IN */
+	u32 next_state;
+	char terminate_buffer[52];
+	u32 terminate_msg_len;
+	u8 is_terminate_local;
+	struct iwch_mpa_attributes mpa_attr;	/* IN-OUT */
+	struct iwch_ep *llp_stream_handle;
+	char *stream_msg_buf;	/* Last stream msg. before Idle -> RTS */
+	u32 stream_msg_buf_len;	/* Only on Idle -> RTS */
+};
+
+struct iwch_qp {
+	struct ib_qp ibqp;
+	struct iwch_dev *rhp;
+	struct iwch_ep *ep;
+	struct iwch_qp_attributes attr;
+	struct t3_wq wq;
+	spinlock_t lock;
+	atomic_t refcnt;
+	wait_queue_head_t wait;
+	enum IWCH_QP_FLAGS flags;
+	struct timer_list timer;
+};
+
+static inline int qp_quiesced(struct iwch_qp *qhp)
+{
+	return (qhp->flags & QP_QUIESCED);
+}
+
+static inline struct iwch_qp *to_iwch_qp(struct ib_qp *ibqp)
+{
+	return container_of(ibqp, struct iwch_qp, ibqp);
+}
+
+void iwch_qp_add_ref(struct ib_qp *qp);
+void iwch_qp_rem_ref(struct ib_qp *qp);
+struct ib_qp *iwch_get_qp(struct ib_device *dev, int qpn);
+
+struct iwch_ucontext {
+	struct ib_ucontext ibucontext;
+	struct cxio_ucontext uctx;
+	spinlock_t mmap_lock;
+	struct list_head mmaps;
+};
+
+static inline struct iwch_ucontext *to_iwch_ucontext(struct ib_ucontext *c)
+{
+	return container_of(c, struct iwch_ucontext, ibucontext);
+}
+
+struct iwch_mm_entry {
+	struct list_head entry;
+	u64 addr;
+	unsigned len;
+};
+
+static inline struct iwch_mm_entry *remove_mmap(struct iwch_ucontext *ucontext,
+						u64 addr, unsigned len)
+{
+	struct list_head *pos, *nxt;
+	struct iwch_mm_entry *mm;
+
+	spin_lock_irq(&ucontext->mmap_lock);
+	list_for_each_safe(pos, nxt, &ucontext->mmaps) {
+		
+		mm = list_entry(pos, struct iwch_mm_entry, entry);
+		if (mm->addr == addr && mm->len == len) {
+			list_del_init(&mm->entry);
+			spin_unlock_irq(&ucontext->mmap_lock);
+			PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr,
+			     mm->len);
+			return mm;
+		}
+	}
+	spin_unlock_irq(&ucontext->mmap_lock);
+	return NULL;
+}
+
+static inline void insert_mmap(struct iwch_ucontext *ucontext,
+			       struct iwch_mm_entry *mm)
+{
+	spin_lock_irq(&ucontext->mmap_lock);
+	PDBG("%s addr 0x%llx len %d\n", __FUNCTION__, mm->addr, mm->len);
+	list_add_tail(&mm->entry, &ucontext->mmaps);
+	spin_unlock_irq(&ucontext->mmap_lock);
+}
+
+enum iwch_qp_attr_mask {
+	IWCH_QP_ATTR_NEXT_STATE = 1 << 0,
+	IWCH_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
+	IWCH_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
+	IWCH_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
+	IWCH_QP_ATTR_MAX_ORD = 1 << 11,
+	IWCH_QP_ATTR_MAX_IRD = 1 << 12,
+	IWCH_QP_ATTR_LLP_STREAM_HANDLE = 1 << 22,
+	IWCH_QP_ATTR_STREAM_MSG_BUFFER = 1 << 23,
+	IWCH_QP_ATTR_MPA_ATTR = 1 << 24,
+	IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE = 1 << 25,
+	IWCH_QP_ATTR_VALID_MODIFY = (IWCH_QP_ATTR_ENABLE_RDMA_READ |
+				     IWCH_QP_ATTR_ENABLE_RDMA_WRITE |
+				     IWCH_QP_ATTR_MAX_ORD |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_STREAM_MSG_BUFFER |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_QP_CONTEXT_ACTIVATE)
+};
+
+int iwch_modify_qp(struct iwch_dev *rhp,
+				struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal);
+
+enum iwch_qp_state {
+	IWCH_QP_STATE_IDLE,
+	IWCH_QP_STATE_RTS,
+	IWCH_QP_STATE_ERROR,
+	IWCH_QP_STATE_TERMINATE,
+	IWCH_QP_STATE_CLOSING,
+	IWCH_QP_STATE_TOT
+};
+
+static inline int iwch_convert_state(enum ib_qp_state ib_state)
+{
+	switch (ib_state) {
+	case IB_QPS_RESET:
+	case IB_QPS_INIT:
+		return IWCH_QP_STATE_IDLE;
+	case IB_QPS_RTS:
+		return IWCH_QP_STATE_RTS;
+	case IB_QPS_SQD:
+		return IWCH_QP_STATE_CLOSING;
+	case IB_QPS_SQE:
+		return IWCH_QP_STATE_TERMINATE;
+	case IB_QPS_ERR:
+		return IWCH_QP_STATE_ERROR;
+	default:
+		return -1;
+	}
+}
+
+enum iwch_mem_perms {
+	IWCH_MEM_ACCESS_LOCAL_READ = 1 << 0,
+	IWCH_MEM_ACCESS_LOCAL_WRITE = 1 << 1,
+	IWCH_MEM_ACCESS_REMOTE_READ = 1 << 2,
+	IWCH_MEM_ACCESS_REMOTE_WRITE = 1 << 3,
+	IWCH_MEM_ACCESS_ATOMICS = 1 << 4,
+	IWCH_MEM_ACCESS_BINDING = 1 << 5,
+	IWCH_MEM_ACCESS_LOCAL =
+	    (IWCH_MEM_ACCESS_LOCAL_READ | IWCH_MEM_ACCESS_LOCAL_WRITE),
+	IWCH_MEM_ACCESS_REMOTE =
+	    (IWCH_MEM_ACCESS_REMOTE_WRITE | IWCH_MEM_ACCESS_REMOTE_READ)
+	    /* cannot go beyond 1 << 31 */
+} __attribute__ ((packed));
+
+static inline u32 iwch_convert_access(int acc)
+{
+	return (acc & IB_ACCESS_REMOTE_WRITE ? IWCH_MEM_ACCESS_REMOTE_WRITE : 0)
+	    | (acc & IB_ACCESS_REMOTE_READ ? IWCH_MEM_ACCESS_REMOTE_READ : 0) |
+	    (acc & IB_ACCESS_LOCAL_WRITE ? IWCH_MEM_ACCESS_LOCAL_WRITE : 0) |
+	    (acc & IB_ACCESS_MW_BIND ? IWCH_MEM_ACCESS_BINDING : 0) |
+	    IWCH_MEM_ACCESS_LOCAL_READ;
+}
+
+enum iwch_mmid_state {
+	IWCH_STAG_STATE_VALID,
+	IWCH_STAG_STATE_INVALID
+};
+
+enum iwch_qp_query_flags {
+	IWCH_QP_QUERY_CONTEXT_NONE = 0x0,	/* No ctx; Only attrs */
+	IWCH_QP_QUERY_CONTEXT_GET = 0x1,	/* Get ctx + attrs */
+	IWCH_QP_QUERY_CONTEXT_SUSPEND = 0x2,	/* Not Supported */
+
+	/*
+	 * Quiesce QP context; Consumer
+	 * will NOT replay outstanding WR
+	 */
+	IWCH_QP_QUERY_CONTEXT_QUIESCE = 0x4,
+	IWCH_QP_QUERY_CONTEXT_REMOVE = 0x8,
+	IWCH_QP_QUERY_TEST_USERWRITE = 0x32	/* Test special */
+};
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr);
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr);
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind);
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg);
+int iwch_register_device(struct iwch_dev *dev);
+void iwch_unregister_device(struct iwch_dev *dev);
+int iwch_quiesce_qps(struct iwch_cq *chp);
+int iwch_resume_qps(struct iwch_cq *chp);
+void stop_read_rep_timer(struct iwch_qp *qhp);
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list);
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages);
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list);
+
+
+#define IWCH_NODE_DESC "cxgb3 Chelsio Communications"
+
+#endif
diff --git a/drivers/infiniband/hw/cxgb3/iwch_user.h b/drivers/infiniband/hw/cxgb3/iwch_user.h
new file mode 100644
index 0000000..4e4b9c9
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_user.h
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __IWCH_USER_H__
+#define __IWCH_USER_H__
+
+#define IWCH_UVERBS_ABI_VERSION	1
+
+/*
+ * Make sure that all structs defined in this file remain laid out so
+ * that they pack the same way on 32-bit and 64-bit architectures (to
+ * avoid incompatibility between 32-bit userspace and 64-bit kernels).
+ * In particular do not use pointer types -- pass pointers in __u64
+ * instead.
+ */
+
+struct iwch_create_cq_resp {
+	__u64 physaddr;		
+	__u32 cqid;
+	__u32 size_log2;
+};
+
+struct iwch_create_qp_resp {
+	__u64 physaddr;
+	__u64 doorbell;	
+	__u32 qpid;
+	__u32 size_log2;
+	__u32 sq_size_log2;
+	__u32 rq_size_log2;
+};
+
+struct iwch_reg_user_mr_resp {
+	__u32 pbl_addr;
+};
+
+struct iwch_req_notify_cq {
+	__u32 rptr;
+};
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:19:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:19:55 -0600
Subject: [openib-general] [PATCH  v5 04/13] iw_cxgb3 Connection Manager
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220191955.19316.52717.stgit@dell3.ogc.int>


This code implements the iWARP CM provider methods for the Chelsio driver.
The Chelsio ULLD is used to setup and teardown TCP connections, and the
T3 RDMA Core is used to move the connections in and out of RDMA mode.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cm.c | 2077 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/iwch_cm.h |  223 ++++
 drivers/infiniband/hw/cxgb3/tcb.h     |  603 ++++++++++
 3 files changed, 2903 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
new file mode 100644
index 0000000..69fcb59
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -0,0 +1,2077 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/module.h>
+#include <linux/list.h>
+#include <linux/workqueue.h>
+#include <linux/skbuff.h>
+#include <linux/timer.h>
+#include <linux/notifier.h>
+
+#include <net/neighbour.h>
+#include <net/netevent.h>
+#include <net/route.h>
+
+#include "tcb.h"
+#include "cxgb3_offload.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+#include "iwch_cm.h"
+
+char *states[] = {
+	"idle",
+	"listen",
+	"connecting",
+	"mpa_wait_req",
+	"mpa_req_sent",
+	"mpa_req_rcvd",
+	"mpa_rep_sent",
+	"fpdu_mode",
+	"aborting",
+	"closing",
+	"moribund",
+	"dead",
+	NULL,
+};
+
+static int ep_timeout_secs = 10;
+module_param(ep_timeout_secs, int, 0444);
+MODULE_PARM_DESC(ep_timeout_secs, "CM Endpoint operation timeout "
+				   "in seconds (default=10)");
+
+static int mpa_rev = 1;
+module_param(mpa_rev, int, 0444);
+MODULE_PARM_DESC(mpa_rev, "MPA Revision, 0 supports amso1100, "
+		 "1 is spec compliant. (default=1)");
+
+static int markers_enabled = 0;
+module_param(markers_enabled, int, 0444);
+MODULE_PARM_DESC(markers_enabled, "Enable MPA MARKERS (default(0)=disabled)");
+
+static int crc_enabled = 1;
+module_param(crc_enabled, int, 0444);
+MODULE_PARM_DESC(crc_enabled, "Enable MPA CRC (default(1)=enabled)");
+
+static int rcv_win = 512 * 1024;
+module_param(rcv_win, int, 0444);
+MODULE_PARM_DESC(rcv_win, "TCP receive window in bytes (default=512KB)");
+
+static int snd_win = 512 * 1024;
+module_param(snd_win, int, 0444);
+MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=512KB)");
+
+static unsigned int nocong = 1;
+module_param(nocong, uint, 0444);
+MODULE_PARM_DESC(nocong, "Turn off congestion control (default=1)");
+
+static void process_work(struct work_struct *work);
+static struct workqueue_struct *workq;
+DECLARE_WORK(skb_work, process_work);
+
+static struct sk_buff_head rxq;
+static cxgb3_cpl_handler_func work_handlers[NUM_CPL_CMDS];
+
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp);
+static void ep_timeout(unsigned long arg);
+static void connect_reply_upcall(struct iwch_ep *ep, int status);
+
+static void start_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	if (timer_pending(&ep->timer)) {
+		PDBG("%s stopped / restarted timer ep %p\n", __FUNCTION__, ep);
+		del_timer_sync(&ep->timer);
+	} else
+		get_ep(&ep->com);
+	ep->timer.expires = jiffies + ep_timeout_secs * HZ;
+	ep->timer.data = (unsigned long)ep;
+	ep->timer.function = ep_timeout;
+	add_timer(&ep->timer);
+}
+
+static void stop_ep_timer(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	del_timer_sync(&ep->timer);
+	put_ep(&ep->com);
+}
+
+static void release_tid(struct t3cdev *tdev, u32 hwtid, struct sk_buff *skb)
+{
+	struct cpl_tid_release *req;
+
+	skb = get_skb(skb, sizeof *req, GFP_KERNEL);
+	if (!skb)
+		return;
+	req = (struct cpl_tid_release *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, hwtid));
+	skb->priority = CPL_PRIORITY_SETUP;
+	tdev->send(tdev, skb);
+	return;
+}
+
+int iwch_quiesce_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = cpu_to_be64(1 << S_TCB_RX_QUIESCE);
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+int iwch_resume_tid(struct iwch_ep *ep)
+{
+	struct cpl_set_tcb_field *req;
+	struct sk_buff *skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+
+	if (!skb)
+		return -ENOMEM;
+	req = (struct cpl_set_tcb_field *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, ep->hwtid));
+	req->reply = 0;
+	req->cpu_idx = 0;
+	req->word = htons(W_TCB_RX_QUIESCE);
+	req->mask = cpu_to_be64(1ULL << S_TCB_RX_QUIESCE);
+	req->val = 0;
+
+	skb->priority = CPL_PRIORITY_DATA;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static void set_emss(struct iwch_ep *ep, u16 opt)
+{
+	PDBG("%s ep %p opt %u\n", __FUNCTION__, ep, opt);
+	ep->emss = T3C_DATA(ep->com.tdev)->mtus[G_TCPOPT_MSS(opt)] - 40;
+	if (G_TCPOPT_TSTAMP(opt))
+		ep->emss -= 12;
+	if (ep->emss < 128)
+		ep->emss = 128;
+	PDBG("emss=%d\n", ep->emss);
+}
+
+static enum iwch_ep_state state_read(struct iwch_ep_common *epc)
+{
+	unsigned long flags;
+	enum iwch_ep_state state;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	state = epc->state;
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return state;
+}
+
+static inline void __state_set(struct iwch_ep_common *epc,
+			       enum iwch_ep_state new)
+{
+	epc->state = new;
+}
+
+static void state_set(struct iwch_ep_common *epc, enum iwch_ep_state new)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&epc->lock, flags);
+	PDBG("%s - %s -> %s\n", __FUNCTION__, states[epc->state], states[new]);
+	__state_set(epc, new);
+	spin_unlock_irqrestore(&epc->lock, flags);
+	return;
+}
+
+static void *alloc_ep(int size, gfp_t gfp)
+{
+	struct iwch_ep_common *epc;
+
+	epc = kmalloc(size, gfp);
+	if (epc) {
+		memset(epc, 0, size);
+		kref_init(&epc->kref);
+		spin_lock_init(&epc->lock);
+		init_waitqueue_head(&epc->waitq);
+	}
+	PDBG("%s alloc ep %p\n", __FUNCTION__, epc);
+	return (void *) epc;
+}
+
+void __free_ep(struct kref *kref)
+{
+	struct iwch_ep_common *epc;
+	epc = container_of(kref, struct iwch_ep_common, kref);
+	PDBG("%s ep %p state %s\n", __FUNCTION__, epc, states[state_read(epc)]);
+	kfree(epc);
+}
+
+static void release_ep_resources(struct iwch_ep *ep)
+{
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	cxgb3_remove_tid(ep->com.tdev, (void *)ep, ep->hwtid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, ep->hwtid, NULL);
+	put_ep(&ep->com);
+}
+
+static void process_work(struct work_struct *work)
+{
+	struct sk_buff *skb = NULL;
+	void *ep;
+	struct t3cdev *tdev;
+	int ret;
+
+	while ((skb = skb_dequeue(&rxq))) {
+		ep = *((void **) (skb->cb));
+		tdev = *((struct t3cdev **) (skb->cb + sizeof(void *)));
+		ret = work_handlers[G_OPCODE(ntohl((__force __be32)skb->csum))](tdev, skb, ep);
+		if (ret & CPL_RET_BUF_DONE)
+			kfree_skb(skb);
+
+		/*
+		 * ep was referenced in sched(), and is freed here.
+		 */
+		put_ep((struct iwch_ep_common *)ep);
+	}
+}
+
+static int status2errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_NONE:
+		return 0;
+	case CPL_ERR_CONN_RESET:
+		return -ECONNRESET;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Try and reuse skbs already allocated...
+ */
+static struct sk_buff *get_skb(struct sk_buff *skb, int len, gfp_t gfp)
+{
+	if (skb) {
+		BUG_ON(skb_cloned(skb));
+		skb_trim(skb, 0);
+		skb_get(skb);
+	} else {
+		skb = alloc_skb(len, gfp);
+	}
+	return skb;
+}
+
+static struct rtable *find_route(struct t3cdev *dev, __be32 local_ip,
+				 __be32 peer_ip, __be16 local_port,
+				 __be16 peer_port, u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = 0,
+		.nl_u = {
+			 .ip4_u = {
+				   .daddr = peer_ip,
+				   .saddr = local_ip,
+				   .tos = tos}
+			 },
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			  .ports = {
+				    .sport = local_port,
+				    .dport = peer_port}
+			  }
+	};
+
+	if (ip_route_output_flow(&rt, &fl, NULL, 0))
+		return NULL;
+	return rt;
+}
+
+static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu)
+		++i;
+	return i;
+}
+
+static void arp_failure_discard(struct t3cdev *dev, struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for an active open.
+ */
+static void act_open_req_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	printk(KERN_ERR MOD "ARP failure duing connect\n");
+	kfree_skb(skb);
+}
+
+/*
+ * Handle an ARP failure for a CPL_ABORT_REQ.  Change it into a no RST variant
+ * and send it along.
+ */
+static void abort_arp_failure(struct t3cdev *dev, struct sk_buff *skb)
+{
+	struct cpl_abort_req *req = cplhdr(skb);
+
+	PDBG("%s t3cdev %p\n", __FUNCTION__, dev);
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb3_ofld_send(dev, skb);
+}
+
+static int send_halfclose(struct iwch_ep *ep, gfp_t gfp)
+{
+	struct cpl_close_con_req *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	req = (struct cpl_close_con_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, ep->hwtid));
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_abort(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	struct cpl_abort_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(skb, sizeof(*req), gfp);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, abort_arp_failure);
+	req = (struct cpl_abort_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ));
+	req->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, ep->hwtid));
+	req->cmd = CPL_ABORT_SEND_RST;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_connect(struct iwch_ep *ep)
+{
+	struct cpl_act_open_req *req;
+	struct sk_buff *skb;
+	u32 opt0h, opt0l, opt2;
+	unsigned int mtu_idx;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb.\n",
+		       __FUNCTION__);
+		return -ENOMEM;
+	}
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+	skb->priority = CPL_PRIORITY_SETUP;
+	set_arp_failure_handler(skb, act_open_req_arp_failure);
+
+	req = (struct cpl_act_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, ep->atid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->peer_port = ep->com.remote_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_ip = ep->com.remote_addr.sin_addr.s_addr;
+	req->opt0h = htonl(opt0h);
+	req->opt0l = htonl(opt0l);
+	req->params = 0;
+	req->opt2 = htonl(opt2);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static void send_mpa_req(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+
+	PDBG("%s ep %p pd_len %d\n", __FUNCTION__, ep, ep->plen);
+
+	BUG_ON(skb_cloned(skb));
+
+	mpalen = sizeof(*mpa) + ep->plen;
+	if (skb->data + mpalen + sizeof(*req) > skb->end) {
+		kfree_skb(skb);
+		skb=alloc_skb(mpalen + sizeof(*req), GFP_KERNEL);
+		if (!skb) {
+			connect_reply_upcall(ep, -ENOMEM);
+			return;
+		}
+	}
+	skb_trim(skb, 0);
+	skb_reserve(skb, sizeof(*req));
+	skb_put(skb, mpalen);
+	skb->priority = CPL_PRIORITY_DATA;
+	mpa = (struct mpa_message *) skb->data;
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REQ, sizeof(mpa->key));
+	mpa->flags = (crc_enabled ? MPA_CRC : 0) |
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->private_data_size = htons(ep->plen);
+	mpa->revision = mpa_rev;
+
+	if (ep->plen)
+		memcpy(mpa->private_data, ep->mpa_pkt + sizeof(*mpa), ep->plen);
+
+	/*
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	start_ep_timer(ep);
+	state_set(&ep->com, MPA_REQ_SENT);
+	return;
+}
+
+static int send_mpa_reject(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = MPA_REJECT;
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/*
+	 * Reference the mpa skb again.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	skb->priority = CPL_PRIORITY_DATA;
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(mpalen);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_IMM_ACK|F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	BUG_ON(ep->mpa_skb);
+	ep->mpa_skb = skb;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int send_mpa_reply(struct iwch_ep *ep, const void *pdata, u8 plen)
+{
+	int mpalen;
+	struct tx_data_wr *req;
+	struct mpa_message *mpa;
+	int len;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p plen %d\n", __FUNCTION__, ep, plen);
+
+	mpalen = sizeof(*mpa) + plen;
+
+	skb = get_skb(NULL, mpalen + sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - cannot alloc skb!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	skb->priority = CPL_PRIORITY_DATA;
+	skb_reserve(skb, sizeof(*req));
+	mpa = (struct mpa_message *) skb_put(skb, mpalen);
+	memset(mpa, 0, sizeof(*mpa));
+	memcpy(mpa->key, MPA_KEY_REP, sizeof(mpa->key));
+	mpa->flags = (ep->mpa_attr.crc_enabled ? MPA_CRC : 0) |
+		     (markers_enabled ? MPA_MARKERS : 0);
+	mpa->revision = mpa_rev;
+	mpa->private_data_size = htons(plen);
+	if (plen)
+		memcpy(mpa->private_data, pdata, plen);
+
+	/*
+	 * Reference the mpa skb.  This ensures the data area
+	 * will remain in memory until the hw acks the tx.
+	 * Function tx_ack() will deref it.
+	 */
+	skb_get(skb);
+	set_arp_failure_handler(skb, arp_failure_discard);
+	skb->h.raw = skb->data;
+	len = skb->len;
+	req = (struct tx_data_wr *) skb_push(skb, sizeof(*req));
+	req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA));
+	req->wr_lo = htonl(V_WR_TID(ep->hwtid));
+	req->len = htonl(len);
+	req->param = htonl(V_TX_PORT(ep->l2t->smt_idx) |
+			   V_TX_SNDBUF(snd_win>>15));
+	req->flags = htonl(F_TX_MORE | F_TX_IMM_ACK | F_TX_INIT);
+	req->sndseq = htonl(ep->snd_seq);
+	ep->mpa_skb = skb;
+	state_set(&ep->com, MPA_REP_SENT);
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+	return 0;
+}
+
+static int act_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_establish *req = cplhdr(skb);
+	unsigned int tid = GET_TID(req);
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, tid);
+
+	dst_confirm(ep->dst);
+
+	/* setup the hwtid for this connection */
+	ep->hwtid = tid;
+	cxgb3_insert_tid(ep->com.tdev, &t3c_client, ep, tid);
+
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	/* dealloc the atid */
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+
+	/* start MPA negotiation */
+	send_mpa_req(ep, skb);
+
+	return 0;
+}
+
+static void abort_connection(struct iwch_ep *ep, struct sk_buff *skb, gfp_t gfp)
+{
+	PDBG("%s ep %p\n", __FILE__, ep);
+	state_set(&ep->com, ABORTING);
+	send_abort(ep, skb, gfp);
+}
+
+static void close_complete_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	if (ep->com.cm_id) {
+		PDBG("close complete delivered ep %p cm_id %p tid %d\n",
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void peer_close_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_DISCONNECT;
+	if (ep->com.cm_id) {
+		PDBG("peer close delivered ep %p cm_id %p tid %d\n",
+		     ep, ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static void peer_abort_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CLOSE;
+	event.status = -ECONNRESET;
+	if (ep->com.cm_id) {
+		PDBG("abort delivered ep %p cm_id %p tid %d\n", ep,
+		     ep->com.cm_id, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_reply_upcall(struct iwch_ep *ep, int status)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p status %d\n", __FUNCTION__, ep, status);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REPLY;
+	event.status = status;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+
+	if ((status == 0) || (status == -ECONNREFUSED)) {
+		event.private_data_len = ep->plen;
+		event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	}
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d status %d\n", __FUNCTION__, ep,
+		     ep->hwtid, status);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+	if (status < 0) {
+		ep->com.cm_id->rem_ref(ep->com.cm_id);
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+	}
+}
+
+static void connect_request_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_CONNECT_REQUEST;
+	event.local_addr = ep->com.local_addr;
+	event.remote_addr = ep->com.remote_addr;
+	event.private_data_len = ep->plen;
+	event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
+	event.provider_data = ep;
+	if (state_read(&ep->parent_ep->com) != DEAD)
+		ep->parent_ep->com.cm_id->event_handler(
+						ep->parent_ep->com.cm_id,
+						&event);
+	put_ep(&ep->parent_ep->com);
+	ep->parent_ep = NULL;
+}
+
+static void established_upcall(struct iwch_ep *ep)
+{
+	struct iw_cm_event event;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	memset(&event, 0, sizeof(event));
+	event.event = IW_CM_EVENT_ESTABLISHED;
+	if (ep->com.cm_id) {
+		PDBG("%s ep %p tid %d\n", __FUNCTION__, ep, ep->hwtid);
+		ep->com.cm_id->event_handler(ep->com.cm_id, &event);
+	}
+}
+
+static int update_rx_credits(struct iwch_ep *ep, u32 credits)
+{
+	struct cpl_rx_data_ack *req;
+	struct sk_buff *skb;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "update_rx_credits - cannot alloc skb!\n");
+		return 0;
+	}
+
+	req = (struct cpl_rx_data_ack *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, ep->hwtid));
+	req->credit_dack = htonl(V_RX_CREDITS(credits) | V_RX_FORCE_ACK(1));
+	skb->priority = CPL_PRIORITY_ACK;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return credits;
+}
+
+static void process_mpa_reply(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	int err;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/*
+ 	 * Stop mpa timer.  If it expired, then the state has
+	 * changed and we bail since ep_timeout already aborted
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) != MPA_REQ_SENT)
+		return;
+
+	/*
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		err = -EINVAL;
+		goto err;
+	}
+
+	/*
+	 * copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/*
+	 * if we don't even have the mpa message, then bail.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/* Validate MPA header. */
+	if (mpa->revision != mpa_rev) {
+		err = -EPROTO;
+		goto err;
+	}
+	if (memcmp(mpa->key, MPA_KEY_REP, sizeof(mpa->key))) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/*
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		err = -EPROTO;
+		goto err;
+	}
+
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	if (mpa->flags & MPA_REJECT) {
+		err = -ECONNREFUSED;
+		goto err;
+	}
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data. And
+	 * the MPA header is valid.
+	 */
+	state_set(&ep->com, FPDU_MODE);
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ird;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+	    IWCH_QP_ATTR_LLP_STREAM_HANDLE | IWCH_QP_ATTR_MPA_ATTR |
+	    IWCH_QP_ATTR_MAX_IRD | IWCH_QP_ATTR_MAX_ORD;
+
+	/* bind QP and TID with INIT_WR */
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+	if (!err)
+		goto out;
+err:
+	abort_connection(ep, skb, GFP_KERNEL);
+out:
+	connect_reply_upcall(ep, err);
+	return;
+}
+
+static void process_mpa_request(struct iwch_ep *ep, struct sk_buff *skb)
+{
+	struct mpa_message *mpa;
+	u16 plen;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	/*
+ 	 * Stop mpa timer.  If it expired, then the state has
+	 * changed and we bail since ep_timeout already aborted
+	 * the connection.
+	 */
+	stop_ep_timer(ep);
+	if (state_read(&ep->com) != MPA_REQ_WAIT)
+		return;
+
+	/*
+	 * If we get more than the supported amount of private data
+	 * then we must fail this connection.
+	 */
+	if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
+		abort_connection(ep, skb, GFP_KERNEL);
+		return;
+	}
+
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+
+	/*
+	 * Copy the new data into our accumulation buffer.
+	 */
+	memcpy(&(ep->mpa_pkt[ep->mpa_pkt_len]), skb->data, skb->len);
+	ep->mpa_pkt_len += skb->len;
+
+	/*
+	 * If we don't even have the mpa message, then bail.
+	 * We'll continue process when more data arrives.
+	 */
+	if (ep->mpa_pkt_len < sizeof(*mpa))
+		return;
+	PDBG("%s enter (%s line %u)\n", __FUNCTION__, __FILE__, __LINE__);
+	mpa = (struct mpa_message *) ep->mpa_pkt;
+
+	/*
+	 * Validate MPA Header.
+	 */
+	if (mpa->revision != mpa_rev) {
+		abort_connection(ep, skb, GFP_KERNEL);
+		return;
+	}
+
+	if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
+		abort_connection(ep, skb, GFP_KERNEL);
+		return;
+	}
+
+	plen = ntohs(mpa->private_data_size);
+
+	/*
+	 * Fail if there's too much private data.
+	 */
+	if (plen > MPA_MAX_PRIVATE_DATA) {
+		abort_connection(ep, skb, GFP_KERNEL);
+		return;
+	}
+
+	/*
+	 * If plen does not account for pkt size
+	 */
+	if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
+		abort_connection(ep, skb, GFP_KERNEL);
+		return;
+	}
+	ep->plen = (u8) plen;
+
+	/*
+	 * If we don't have all the pdata yet, then bail.
+	 */
+	if (ep->mpa_pkt_len < (sizeof(*mpa) + plen))
+		return;
+
+	/*
+	 * If we get here we have accumulated the entire mpa
+	 * start reply message including private data.
+	 */
+	ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
+	ep->mpa_attr.recv_marker_enabled = markers_enabled;
+	ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
+	ep->mpa_attr.version = mpa_rev;
+	PDBG("%s - crc_enabled=%d, recv_marker_enabled=%d, "
+	     "xmit_marker_enabled=%d, version=%d\n", __FUNCTION__,
+	     ep->mpa_attr.crc_enabled, ep->mpa_attr.recv_marker_enabled,
+	     ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version);
+
+	state_set(&ep->com, MPA_REQ_RCVD);
+
+	/* drive upcall */
+	connect_request_upcall(ep);
+	return;
+}
+
+static int rx_data(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_rx_data *hdr = cplhdr(skb);
+	unsigned int dlen = ntohs(hdr->len);
+
+	PDBG("%s ep %p dlen %u\n", __FUNCTION__, ep, dlen);
+
+	skb_pull(skb, sizeof(*hdr));
+	skb_trim(skb, dlen);
+
+	switch (state_read(&ep->com)) {
+	case MPA_REQ_SENT:
+		process_mpa_reply(ep, skb);
+		break;
+	case MPA_REQ_WAIT:
+		process_mpa_request(ep, skb);
+		break;
+	case MPA_REP_SENT:
+		break;
+	default:
+		printk(KERN_ERR MOD "%s Unexpected streaming data."
+		       " ep %p state %d tid %d\n",
+		       __FUNCTION__, ep, state_read(&ep->com), ep->hwtid);
+
+		/*
+	 	 * The ep will timeout and inform the ULP of the failure.
+		 * See ep_timeout().
+	 	 */
+		break;
+	}
+
+	/* update RX credits */
+	update_rx_credits(ep, dlen);
+
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Upcall from the adapter indicating data has been transmitted.
+ * For us its just the single MPA request or reply.  We can now free
+ * the skb holding the mpa message.
+ */
+static int tx_ack(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_wr_ack *hdr = cplhdr(skb);
+	unsigned int credits = ntohs(hdr->credits);
+	enum iwch_qp_attr_mask  mask;
+
+	PDBG("%s ep %p credits %u\n", __FUNCTION__, ep, credits);
+
+	if (credits == 0)
+		return CPL_RET_BUF_DONE;
+	BUG_ON(credits != 1);
+	BUG_ON(ep->mpa_skb == NULL);
+	kfree_skb(ep->mpa_skb);
+	ep->mpa_skb = NULL;
+	dst_confirm(ep->dst);
+	if (state_read(&ep->com) == MPA_REP_SENT) {
+		struct iwch_qp_attributes attrs;
+
+		/* bind QP to EP and move to RTS */
+		attrs.mpa_attr = ep->mpa_attr;
+		attrs.max_ird = ep->ord;
+		attrs.max_ord = ep->ord;
+		attrs.llp_stream_handle = ep;
+		attrs.next_state = IWCH_QP_STATE_RTS;
+
+		/* bind QP and TID with INIT_WR */
+		mask = IWCH_QP_ATTR_NEXT_STATE |
+				     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+				     IWCH_QP_ATTR_MPA_ATTR |
+				     IWCH_QP_ATTR_MAX_IRD |
+				     IWCH_QP_ATTR_MAX_ORD;
+
+		ep->com.rpl_err = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, mask, &attrs, 1);
+
+		if (!ep->com.rpl_err) {
+			state_set(&ep->com, FPDU_MODE);
+			established_upcall(ep);
+		}
+
+		ep->com.rpl_done = 1;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int abort_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	close_complete_upcall(ep);
+	state_set(&ep->com, DEAD);
+	release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+static int act_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_act_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %u errno %d\n", __FUNCTION__, ep, rpl->status,
+	     status2errno(rpl->status));
+	connect_reply_upcall(ep, status2errno(rpl->status));
+	state_set(&ep->com, DEAD);
+	if (ep->com.tdev->type == T3B)
+		release_tid(ep->com.tdev, GET_TID(rpl), NULL);
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+	dst_release(ep->dst);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	put_ep(&ep->com);
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_start(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_pass_open_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "t3c_listen_start failed to alloc skb!\n");
+		return -ENOMEM;
+	}
+
+	req = (struct cpl_pass_open_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_PASS_OPEN_REQ, ep->stid));
+	req->local_port = ep->com.local_addr.sin_port;
+	req->local_ip = ep->com.local_addr.sin_addr.s_addr;
+	req->peer_port = 0;
+	req->peer_ip = 0;
+	req->peer_netmask = 0;
+	req->opt0h = htonl(F_DELACK | F_TCAM_BYPASS);
+	req->opt0l = htonl(V_RCV_BUFSIZ(rcv_win>>10));
+	req->opt1 = htonl(V_CONN_POLICY(CPL_CONN_POLICY_ASK));
+
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int pass_open_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_pass_open_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p status %d error %d\n", __FUNCTION__, ep,
+	     rpl->status, status2errno(rpl->status));
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int listen_stop(struct iwch_listen_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_close_listserv_req *req;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (!skb) {
+		printk(KERN_ERR MOD "%s - failed to alloc skb\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	req = (struct cpl_close_listserv_req *) skb_put(skb, sizeof(*req));
+	req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_LISTSRV_REQ, ep->stid));
+	skb->priority = 1;
+	ep->com.tdev->send(ep->com.tdev, skb);
+	return 0;
+}
+
+static int close_listsrv_rpl(struct t3cdev *tdev, struct sk_buff *skb,
+			     void *ctx)
+{
+	struct iwch_listen_ep *ep = ctx;
+	struct cpl_close_listserv_rpl *rpl = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.rpl_err = status2errno(rpl->status);
+	ep->com.rpl_done = 1;
+	wake_up(&ep->com.waitq);
+	return CPL_RET_BUF_DONE;
+}
+
+static void accept_cr(struct iwch_ep *ep, __be32 peer_ip, struct sk_buff *skb)
+{
+	struct cpl_pass_accept_rpl *rpl;
+	unsigned int mtu_idx;
+	u32 opt0h, opt0l, opt2;
+	int wscale;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(*rpl));
+	skb_get(skb);
+	mtu_idx = find_best_mtu(T3C_DATA(ep->com.tdev), dst_mtu(ep->dst));
+	wscale = compute_wscale(rcv_win);
+	opt0h = V_NAGLE(0) |
+	    V_NO_CONG(nocong) |
+	    V_KEEP_ALIVE(1) |
+	    F_TCAM_BYPASS |
+	    V_WND_SCALE(wscale) |
+	    V_MSS_IDX(mtu_idx) |
+	    V_L2T_IDX(ep->l2t->idx) | V_TX_CHANNEL(ep->l2t->smt_idx);
+	opt0l = V_TOS((ep->tos >> 2) & M_TOS) | V_RCV_BUFSIZ(rcv_win>>10);
+	opt2 = V_FLAVORS_VALID(0) | V_CONG_CONTROL_FLAVOR(0);
+
+	rpl = cplhdr(skb);
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, ep->hwtid));
+	rpl->peer_ip = peer_ip;
+	rpl->opt0h = htonl(opt0h);
+	rpl->opt0l_status = htonl(opt0l | CPL_PASS_OPEN_ACCEPT);
+	rpl->opt2 = htonl(opt2);
+	rpl->rsvd = rpl->opt2;	/* workaround for HW bug */
+	skb->priority = CPL_PRIORITY_SETUP;
+	l2t_send(ep->com.tdev, skb, ep->l2t);
+
+	return;
+}
+
+static void reject_cr(struct t3cdev *tdev, u32 hwtid, __be32 peer_ip,
+		      struct sk_buff *skb)
+{
+	PDBG("%s t3cdev %p tid %u peer_ip %x\n", __FUNCTION__, tdev, hwtid,
+	     peer_ip);
+	BUG_ON(skb_cloned(skb));
+	skb_trim(skb, sizeof(struct cpl_tid_release));
+	skb_get(skb);
+
+	if (tdev->type == T3B)
+		release_tid(tdev, hwtid, skb);
+	else {
+		struct cpl_pass_accept_rpl *rpl;
+
+		rpl = cplhdr(skb);
+		skb->priority = CPL_PRIORITY_SETUP;
+		rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD));
+		OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL,
+						      hwtid));
+		rpl->peer_ip = peer_ip;
+		rpl->opt0h = htonl(F_TCAM_BYPASS);
+		rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT);
+		rpl->opt2 = 0;
+		rpl->rsvd = rpl->opt2;
+		tdev->send(tdev, skb);
+	}
+}
+
+static int pass_accept_req(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *child_ep, *parent_ep = ctx;
+	struct cpl_pass_accept_req *req = cplhdr(skb);
+	unsigned int hwtid = GET_TID(req);
+	struct dst_entry *dst;
+	struct l2t_entry *l2t;
+	struct rtable *rt;
+	struct iff_mac tim;
+
+	PDBG("%s parent ep %p tid %u\n", __FUNCTION__, parent_ep, hwtid);
+
+	if (state_read(&parent_ep->com) != LISTEN) {
+		printk(KERN_ERR "%s - listening ep not in LISTEN\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+
+	/*
+	 * Find the netdev for this connection request.
+	 */
+	tim.mac_addr = req->dst_mac;
+	tim.vlan_tag = ntohs(req->vlan_tag);
+	if (tdev->ctl(tdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) {
+		printk(KERN_ERR
+			"%s bad dst mac %02x %02x %02x %02x %02x %02x\n",
+			__FUNCTION__,
+			req->dst_mac[0],
+			req->dst_mac[1],
+			req->dst_mac[2],
+			req->dst_mac[3],
+			req->dst_mac[4],
+			req->dst_mac[5]);
+		goto reject;
+	}
+
+	/* Find output route */
+	rt = find_route(tdev,
+			req->local_ip,
+			req->peer_ip,
+			req->local_port,
+			req->peer_port, G_PASS_OPEN_TOS(ntohl(req->tos_tid)));
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - failed to find dst entry!\n",
+		       __FUNCTION__);
+		goto reject;
+	}
+	dst = &rt->u.dst;
+	l2t = t3_l2t_get(tdev, dst->neighbour, dst->neighbour->dev);
+	if (!l2t) {
+		printk(KERN_ERR MOD "%s - failed to allocate l2t entry!\n",
+		       __FUNCTION__);
+		dst_release(dst);
+		goto reject;
+	}
+	child_ep = alloc_ep(sizeof(*child_ep), GFP_KERNEL);
+	if (!child_ep) {
+		printk(KERN_ERR MOD "%s - failed to allocate ep entry!\n",
+		       __FUNCTION__);
+		l2t_release(L2DATA(tdev), l2t);
+		dst_release(dst);
+		goto reject;
+	}
+	state_set(&child_ep->com, CONNECTING);
+	child_ep->com.tdev = tdev;
+	child_ep->com.cm_id = NULL;
+	child_ep->com.local_addr.sin_family = PF_INET;
+	child_ep->com.local_addr.sin_port = req->local_port;
+	child_ep->com.local_addr.sin_addr.s_addr = req->local_ip;
+	child_ep->com.remote_addr.sin_family = PF_INET;
+	child_ep->com.remote_addr.sin_port = req->peer_port;
+	child_ep->com.remote_addr.sin_addr.s_addr = req->peer_ip;
+	get_ep(&parent_ep->com);
+	child_ep->parent_ep = parent_ep;
+	child_ep->tos = G_PASS_OPEN_TOS(ntohl(req->tos_tid));
+	child_ep->l2t = l2t;
+	child_ep->dst = dst;
+	child_ep->hwtid = hwtid;
+	init_timer(&child_ep->timer);
+	cxgb3_insert_tid(tdev, &t3c_client, child_ep, hwtid);
+	accept_cr(child_ep, req->peer_ip, skb);
+	goto out;
+reject:
+	reject_cr(tdev, hwtid, req->peer_ip, skb);
+out:
+	return CPL_RET_BUF_DONE;
+}
+
+static int pass_establish(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct cpl_pass_establish *req = cplhdr(skb);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->snd_seq = ntohl(req->snd_isn);
+
+	set_emss(ep, ntohs(req->tcp_opt));
+
+	dst_confirm(ep->dst);
+	state_set(&ep->com, MPA_REQ_WAIT);
+	start_ep_timer(ep);
+
+	return CPL_RET_BUF_DONE;
+}
+
+static int peer_close(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	unsigned long flags;
+	int disconnect = 1;
+	int release = 0;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	dst_confirm(ep->dst);
+
+	spin_lock_irqsave(&ep->com.lock, flags);
+	switch (ep->com.state) {
+	case MPA_REQ_WAIT:
+		__state_set(&ep->com, CLOSING);
+		break;
+	case MPA_REQ_SENT:
+		__state_set(&ep->com, CLOSING);
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REQ_RCVD:
+
+		/*
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		__state_set(&ep->com, CLOSING);
+		get_ep(&ep->com);
+		break;
+	case MPA_REP_SENT:
+		__state_set(&ep->com, CLOSING);
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case FPDU_MODE:
+		__state_set(&ep->com, CLOSING);
+		attrs.next_state = IWCH_QP_STATE_CLOSING;
+		iwch_modify_qp(ep->com.qp->rhp, ep->com.qp,
+			       IWCH_QP_ATTR_NEXT_STATE, &attrs, 1);
+		peer_close_upcall(ep);
+		break;
+	case ABORTING:
+		disconnect = 0;
+		break;
+	case CLOSING:
+		start_ep_timer(ep);
+		__state_set(&ep->com, MORIBUND);
+		disconnect = 0;
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp, ep->com.qp,
+				       IWCH_QP_ATTR_NEXT_STATE, &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		__state_set(&ep->com, DEAD);
+		release = 1;
+		disconnect = 0;
+		break;
+	case DEAD:
+		disconnect = 0;
+		break;
+	default:
+		BUG_ON(1);
+	}
+	spin_unlock_irqrestore(&ep->com.lock, flags);
+	if (disconnect)
+		iwch_ep_disconnect(ep, 0, GFP_KERNEL);	
+	if (release)
+		release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+        return status == CPL_ERR_RTX_NEG_ADVICE ||
+               status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int peer_abort(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_abort_req_rss *req = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+	struct cpl_abort_rpl *rpl;
+	struct sk_buff *rpl_skb;
+	struct iwch_qp_attributes attrs;
+	int ret;
+	int state;
+
+	if (is_neg_adv_abort(req->status)) {
+		PDBG("%s neg_adv_abort ep %p tid %d\n", __FUNCTION__, ep,
+		     ep->hwtid);
+		t3_l2t_send_event(ep->com.tdev, ep->l2t);
+		return CPL_RET_BUF_DONE;
+	}
+
+	state = state_read(&ep->com);
+	PDBG("%s ep %p state %u\n", __FUNCTION__, ep, state);
+	switch (state) {
+	case CONNECTING:
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ECONNRESET);
+		break;
+	case MPA_REP_SENT:
+		ep->com.rpl_done = 1;
+		ep->com.rpl_err = -ECONNRESET;
+		PDBG("waking up ep %p\n", ep);
+		wake_up(&ep->com.waitq);
+		break;
+	case MPA_REQ_RCVD:
+	
+		/*
+		 * We're gonna mark this puppy DEAD, but keep
+		 * the reference on it until the ULP accepts or
+		 * rejects the CR.
+		 */
+		get_ep(&ep->com);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+	case FPDU_MODE:
+	case CLOSING:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			ret = iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+			if (ret)
+				printk(KERN_ERR MOD
+				       "%s - qp <- error failed!\n",
+				       __FUNCTION__);
+		}
+		peer_abort_upcall(ep);
+		break;
+	case ABORTING:
+		break;
+	case DEAD:
+		PDBG("%s PEER_ABORT IN DEAD STATE!!!!\n", __FUNCTION__);
+		return CPL_RET_BUF_DONE;
+	default:
+		BUG_ON(1);
+		break;
+	}
+	dst_confirm(ep->dst);
+	
+	rpl_skb = get_skb(skb, sizeof(*rpl), GFP_KERNEL);
+	if (!rpl_skb) {
+		printk(KERN_ERR MOD "%s - cannot allocate skb!\n",
+		       __FUNCTION__);
+		dst_release(ep->dst);
+		l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+		put_ep(&ep->com);
+		return CPL_RET_BUF_DONE;
+	}
+	rpl_skb->priority = CPL_PRIORITY_DATA;
+	rpl = (struct cpl_abort_rpl *) skb_put(rpl_skb, sizeof(*rpl));
+	rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL));
+	rpl->wr.wr_lo = htonl(V_WR_TID(ep->hwtid));
+	OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, ep->hwtid));
+	rpl->cmd = CPL_ABORT_NO_RST;
+	ep->com.tdev->send(ep->com.tdev, rpl_skb);
+	if (state != ABORTING) {
+		state_set(&ep->com, DEAD);
+		release_ep_resources(ep);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static int close_con_rpl(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+	struct iwch_qp_attributes attrs;
+	unsigned long flags;
+	int release = 0;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	BUG_ON(!ep);
+
+	/* The cm_id may be null if we failed to connect */
+	spin_lock_irqsave(&ep->com.lock, flags);
+	switch (ep->com.state) {
+	case CLOSING:
+		start_ep_timer(ep);
+		__state_set(&ep->com, MORIBUND);
+		break;
+	case MORIBUND:
+		stop_ep_timer(ep);
+		if ((ep->com.cm_id) && (ep->com.qp)) {
+			attrs.next_state = IWCH_QP_STATE_IDLE;
+			iwch_modify_qp(ep->com.qp->rhp,
+					     ep->com.qp,
+					     IWCH_QP_ATTR_NEXT_STATE,
+					     &attrs, 1);
+		}
+		close_complete_upcall(ep);
+		__state_set(&ep->com, DEAD);
+		release = 1;
+		break;
+	case DEAD:
+	default:
+		BUG_ON(1);
+		break;
+	}
+	spin_unlock_irqrestore(&ep->com.lock, flags);
+	if (release)
+		release_ep_resources(ep);
+	return CPL_RET_BUF_DONE;
+}
+
+/*
+ * T3A does 3 things when a TERM is received:
+ * 1) send up a CPL_RDMA_TERMINATE message with the TERM packet
+ * 2) generate an async event on the QP with the TERMINATE opcode
+ * 3) post a TERMINATE opcde cqe into the associated CQ.
+ *
+ * For (1), we save the message in the qp for later consumer consumption.
+ * For (2), we move the QP into TERMINATE, post a QP event and disconnect.
+ * For (3), we toss the CQE in cxio_poll_cq().
+ *
+ * terminate() handles case (1)...
+ */
+static int terminate(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	skb_pull(skb, sizeof(struct cpl_rdma_terminate));
+	PDBG("%s saving %d bytes of term msg\n", __FUNCTION__, skb->len);
+	memcpy(ep->com.qp->attr.terminate_buffer, skb->data, skb->len);
+	ep->com.qp->attr.terminate_msg_len = skb->len;
+	ep->com.qp->attr.is_terminate_local = 0;
+	return CPL_RET_BUF_DONE;
+}
+
+static int ec_status(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct cpl_rdma_ec_status *rep = cplhdr(skb);
+	struct iwch_ep *ep = ctx;
+
+	PDBG("%s ep %p tid %u status %d\n", __FUNCTION__, ep, ep->hwtid,
+	     rep->status);
+	if (rep->status) {
+		struct iwch_qp_attributes attrs;
+
+		printk(KERN_ERR MOD "%s BAD CLOSE - Aborting tid %u\n",
+		       __FUNCTION__, ep->hwtid);
+		attrs.next_state = IWCH_QP_STATE_ERROR;
+		iwch_modify_qp(ep->com.qp->rhp,
+			       ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+			       &attrs, 1);
+		abort_connection(ep, NULL, GFP_KERNEL);
+	}
+	return CPL_RET_BUF_DONE;
+}
+
+static void ep_timeout(unsigned long arg)
+{
+	struct iwch_ep *ep = (struct iwch_ep *)arg;
+	struct iwch_qp_attributes attrs;
+	unsigned long flags;
+
+	spin_lock_irqsave(&ep->com.lock, flags);
+	PDBG("%s ep %p tid %u state %d\n", __FUNCTION__, ep, ep->hwtid,
+	     ep->com.state);
+	switch (ep->com.state) {
+	case MPA_REQ_SENT:
+		connect_reply_upcall(ep, -ETIMEDOUT);
+		break;
+	case MPA_REQ_WAIT:
+		break;
+	case MORIBUND:
+		if (ep->com.cm_id && ep->com.qp) {
+			attrs.next_state = IWCH_QP_STATE_ERROR;
+			iwch_modify_qp(ep->com.qp->rhp,
+				     ep->com.qp, IWCH_QP_ATTR_NEXT_STATE,
+				     &attrs, 1);
+		}
+		break;
+	default:
+		BUG();
+	}
+	__state_set(&ep->com, CLOSING);
+	spin_unlock_irqrestore(&ep->com.lock, flags);
+	abort_connection(ep, NULL, GFP_ATOMIC);
+	put_ep(&ep->com);
+}
+
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
+{
+	int err;
+	struct iwch_ep *ep = to_ep(cm_id);
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	state_set(&ep->com, CLOSING);
+	if (mpa_rev == 0)
+		abort_connection(ep, NULL, GFP_KERNEL);
+	else {
+		err = send_mpa_reject(ep, pdata, pdata_len);
+		err = send_halfclose(ep, GFP_KERNEL);
+	}
+	return 0;
+}
+
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err;
+	struct iwch_qp_attributes attrs;
+	enum iwch_qp_attr_mask mask;
+	struct iwch_ep *ep = to_ep(cm_id);
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_qp *qp = get_qhp(h, conn_param->qpn);
+
+	PDBG("%s ep %p tid %u\n", __FUNCTION__, ep, ep->hwtid);
+	if (state_read(&ep->com) == DEAD) {
+		put_ep(&ep->com);
+		return -ECONNRESET;
+	}
+
+	BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+	BUG_ON(!qp);
+
+	if ((conn_param->ord > qp->rhp->attr.max_rdma_read_qp_depth) ||
+	    (conn_param->ird > qp->rhp->attr.max_rdma_reads_per_qp)) {
+		abort_connection(ep, NULL, GFP_KERNEL);
+		return -EINVAL;
+	}
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = qp;
+
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	PDBG("%s %d ird %d ord %d\n", __FUNCTION__, __LINE__, ep->ird, ep->ord);
+	get_ep(&ep->com);
+	err = send_mpa_reply(ep, conn_param->private_data,
+			     conn_param->private_data_len);
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL, GFP_KERNEL);
+		put_ep(&ep->com);
+		return err;
+	}
+	
+	/* bind QP to EP and move to RTS */
+	attrs.mpa_attr = ep->mpa_attr;
+	attrs.max_ird = ep->ord;
+	attrs.max_ord = ep->ord;
+	attrs.llp_stream_handle = ep;
+	attrs.next_state = IWCH_QP_STATE_RTS;
+
+	/* bind QP and TID with INIT_WR */
+	mask = IWCH_QP_ATTR_NEXT_STATE |
+			     IWCH_QP_ATTR_LLP_STREAM_HANDLE |
+			     IWCH_QP_ATTR_MPA_ATTR |
+			     IWCH_QP_ATTR_MAX_IRD |
+			     IWCH_QP_ATTR_MAX_ORD;
+
+	err = iwch_modify_qp(ep->com.qp->rhp,
+			     ep->com.qp, mask, &attrs, 1);
+
+	if (err) {
+		ep->com.cm_id = NULL;
+		ep->com.qp = NULL;
+		cm_id->rem_ref(cm_id);
+		abort_connection(ep, NULL, GFP_KERNEL);
+	} else {
+		state_set(&ep->com, FPDU_MODE);
+		established_upcall(ep);
+	}
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_ep *ep;
+	struct rtable *rt;
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto out;
+	}
+	init_timer(&ep->timer);
+	ep->plen = conn_param->private_data_len;
+	if (ep->plen)
+		memcpy(ep->mpa_pkt + sizeof(struct mpa_message),
+		       conn_param->private_data, ep->plen);
+	ep->ird = conn_param->ird;
+	ep->ord = conn_param->ord;
+	ep->com.tdev = h->rdev.t3cdev_p;
+
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->com.qp = get_qhp(h, conn_param->qpn);
+	BUG_ON(!ep->com.qp);
+	PDBG("%s qpn 0x%x qp %p cm_id %p\n", __FUNCTION__, conn_param->qpn,
+	     ep->com.qp, cm_id);
+
+	/*
+	 * Allocate an active TID to initiate a TCP connection.
+	 */
+	ep->atid = cxgb3_alloc_atid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->atid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	/* find a route */
+	rt = find_route(h->rdev.t3cdev_p,
+			cm_id->local_addr.sin_addr.s_addr,
+			cm_id->remote_addr.sin_addr.s_addr,
+			cm_id->local_addr.sin_port,
+			cm_id->remote_addr.sin_port, IPTOS_LOWDELAY);
+	if (!rt) {
+		printk(KERN_ERR MOD "%s - cannot find route.\n", __FUNCTION__);
+		err = -EHOSTUNREACH;
+		goto fail3;
+	}
+	ep->dst = &rt->u.dst;
+
+	/* get a l2t entry */
+	ep->l2t = t3_l2t_get(ep->com.tdev, ep->dst->neighbour,
+			     ep->dst->neighbour->dev);
+	if (!ep->l2t) {
+		printk(KERN_ERR MOD "%s - cannot alloc l2e.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail4;
+	}
+
+	state_set(&ep->com, CONNECTING);
+	ep->tos = IPTOS_LOWDELAY;
+	ep->com.local_addr = cm_id->local_addr;
+	ep->com.remote_addr = cm_id->remote_addr;
+
+	/* send connect request to rnic */
+	err = send_connect(ep);
+	if (!err)
+		goto out;
+
+	l2t_release(L2DATA(h->rdev.t3cdev_p), ep->l2t);
+fail4:
+	dst_release(ep->dst);
+fail3:
+	cxgb3_free_atid(ep->com.tdev, ep->atid);
+fail2:
+	put_ep(&ep->com);
+out:
+	return err;
+}
+
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog)
+{
+	int err = 0;
+	struct iwch_dev *h = to_iwch_dev(cm_id->device);
+	struct iwch_listen_ep *ep;
+
+
+	might_sleep();
+
+	ep = alloc_ep(sizeof(*ep), GFP_KERNEL);
+	if (!ep) {
+		printk(KERN_ERR MOD "%s - cannot alloc ep.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail1;
+	}
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+	ep->com.tdev = h->rdev.t3cdev_p;
+	cm_id->add_ref(cm_id);
+	ep->com.cm_id = cm_id;
+	ep->backlog = backlog;
+	ep->com.local_addr = cm_id->local_addr;
+
+	/*
+	 * Allocate a server TID.
+	 */
+	ep->stid = cxgb3_alloc_stid(h->rdev.t3cdev_p, &t3c_client, ep);
+	if (ep->stid == -1) {
+		printk(KERN_ERR MOD "%s - cannot alloc atid.\n", __FUNCTION__);
+		err = -ENOMEM;
+		goto fail2;
+	}
+
+	state_set(&ep->com, LISTEN);
+	err = listen_start(ep);
+	if (err)
+		goto fail3;
+
+	/* wait for pass_open_rpl */
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	err = ep->com.rpl_err;
+	if (!err) {
+		cm_id->provider_data = ep;
+		goto out;
+	}
+fail3:
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+fail2:
+	put_ep(&ep->com);
+fail1:
+out:
+	return err;
+}
+
+int iwch_destroy_listen(struct iw_cm_id *cm_id)
+{
+	int err;
+	struct iwch_listen_ep *ep = to_listen_ep(cm_id);
+
+	PDBG("%s ep %p\n", __FUNCTION__, ep);
+
+	might_sleep();
+	state_set(&ep->com, DEAD);
+	ep->com.rpl_done = 0;
+	ep->com.rpl_err = 0;
+	err = listen_stop(ep);
+	wait_event(ep->com.waitq, ep->com.rpl_done);
+	cxgb3_free_stid(ep->com.tdev, ep->stid);
+	err = ep->com.rpl_err;
+	cm_id->rem_ref(cm_id);
+	put_ep(&ep->com);
+	return err;
+}
+
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp)
+{
+	int ret=0;
+	unsigned long flags;
+	int close = 0;
+	
+	spin_lock_irqsave(&ep->com.lock, flags);
+
+	PDBG("%s ep %p state %s, abrupt %d\n", __FUNCTION__, ep,
+	     states[ep->com.state], abrupt);
+
+	if (ep->com.state == DEAD) {
+		PDBG("%s already dead ep %p\n", __FUNCTION__, ep);
+		goto out;
+	}
+
+	if (abrupt) {
+		if (ep->com.state != ABORTING) {
+			ep->com.state = ABORTING;
+			close = 1;
+		}
+		goto out;
+	}
+	
+	switch (ep->com.state) {
+	case MPA_REQ_WAIT:
+	case MPA_REQ_SENT:
+	case MPA_REQ_RCVD:
+	case MPA_REP_SENT:
+	case FPDU_MODE:
+		ep->com.state = CLOSING;
+		close = 1;
+		break;
+	case CLOSING:
+		start_ep_timer(ep);
+		ep->com.state = MORIBUND;
+		close = 1;
+		break;
+	case MORIBUND:
+		break;
+	default:
+		BUG();
+		break;
+	}
+out:
+	spin_unlock_irqrestore(&ep->com.lock, flags);
+	if (close) {
+		if (abrupt)
+			ret = send_abort(ep, NULL, gfp);
+		else
+			ret = send_halfclose(ep, gfp);
+	}
+	return ret;
+}
+
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new,
+		     struct l2t_entry *l2t)
+{
+	struct iwch_ep *ep = ctx;
+	
+	if (ep->dst != old)
+		return 0;
+
+	PDBG("%s ep %p redirect to dst %p l2t %p\n", __FUNCTION__, ep, new,
+	     l2t);
+	dst_hold(new);
+	l2t_release(L2DATA(ep->com.tdev), ep->l2t);
+	ep->l2t = l2t;
+	dst_release(old);
+	ep->dst = new;
+	return 1;
+}
+
+/*
+ * All the CM events are handled on a work queue to have a safe context.
+ */
+static int sched(struct t3cdev *tdev, struct sk_buff *skb, void *ctx)
+{
+	struct iwch_ep_common *epc = ctx;
+
+	get_ep(epc);
+
+	/*
+	 * Save ctx and tdev in the skb->cb area.
+	 */
+	*((void **) skb->cb) = ctx;
+	*((struct t3cdev **) (skb->cb + sizeof(void *))) = tdev;
+
+	/*
+	 * Queue the skb and schedule the worker thread.
+	 */
+	skb_queue_tail(&rxq, skb);
+	queue_work(workq, &skb_work);
+	return 0;
+}
+
+int __init iwch_cm_init(void)
+{
+	skb_queue_head_init(&rxq);
+
+	workq = create_singlethread_workqueue("iw_cxgb3");
+	if (!workq)
+		return -ENOMEM;
+
+	/*
+	 * All upcalls from the T3 Core go to sched() to
+	 * schedule the processing on a work queue.
+	 */
+	t3c_handlers[CPL_ACT_ESTABLISH] = sched;
+	t3c_handlers[CPL_ACT_OPEN_RPL] = sched;
+	t3c_handlers[CPL_RX_DATA] = sched;
+	t3c_handlers[CPL_TX_DMA_ACK] = sched;
+	t3c_handlers[CPL_ABORT_RPL_RSS] = sched;
+	t3c_handlers[CPL_ABORT_RPL] = sched;
+	t3c_handlers[CPL_PASS_OPEN_RPL] = sched;
+	t3c_handlers[CPL_CLOSE_LISTSRV_RPL] = sched;
+	t3c_handlers[CPL_PASS_ACCEPT_REQ] = sched;
+	t3c_handlers[CPL_PASS_ESTABLISH] = sched;
+	t3c_handlers[CPL_PEER_CLOSE] = sched;
+	t3c_handlers[CPL_CLOSE_CON_RPL] = sched;
+	t3c_handlers[CPL_ABORT_REQ_RSS] = sched;
+	t3c_handlers[CPL_RDMA_TERMINATE] = sched;
+	t3c_handlers[CPL_RDMA_EC_STATUS] = sched;
+
+	/*
+	 * These are the real handlers that are called from a
+	 * work queue.
+	 */
+	work_handlers[CPL_ACT_ESTABLISH] = act_establish;
+	work_handlers[CPL_ACT_OPEN_RPL] = act_open_rpl;
+	work_handlers[CPL_RX_DATA] = rx_data;
+	work_handlers[CPL_TX_DMA_ACK] = tx_ack;
+	work_handlers[CPL_ABORT_RPL_RSS] = abort_rpl;
+	work_handlers[CPL_ABORT_RPL] = abort_rpl;
+	work_handlers[CPL_PASS_OPEN_RPL] = pass_open_rpl;
+	work_handlers[CPL_CLOSE_LISTSRV_RPL] = close_listsrv_rpl;
+	work_handlers[CPL_PASS_ACCEPT_REQ] = pass_accept_req;
+	work_handlers[CPL_PASS_ESTABLISH] = pass_establish;
+	work_handlers[CPL_PEER_CLOSE] = peer_close;
+	work_handlers[CPL_ABORT_REQ_RSS] = peer_abort;
+	work_handlers[CPL_CLOSE_CON_RPL] = close_con_rpl;
+	work_handlers[CPL_RDMA_TERMINATE] = terminate;
+	work_handlers[CPL_RDMA_EC_STATUS] = ec_status;
+	return 0;
+}
+
+void __exit iwch_cm_term(void)
+{
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
+}
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.h b/drivers/infiniband/hw/cxgb3/iwch_cm.h
new file mode 100644
index 0000000..893f9d0
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.h
@@ -0,0 +1,223 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef _IWCH_CM_H_
+#define _IWCH_CM_H_
+
+#include <linux/inet.h>
+#include <linux/wait.h>
+#include <linux/spinlock.h>
+#include <linux/kref.h>
+
+#include <rdma/ib_verbs.h>
+#include <rdma/iw_cm.h>
+
+#include "cxgb3_offload.h"
+#include "iwch_provider.h"
+
+#define MPA_KEY_REQ "MPA ID Req Frame"
+#define MPA_KEY_REP "MPA ID Rep Frame"
+
+#define MPA_MAX_PRIVATE_DATA 	256
+#define MPA_REV 		0	/* XXX - amso1100 uses rev 0 ! */
+#define MPA_REJECT 		0x20
+#define MPA_CRC			0x40
+#define MPA_MARKERS		0x80
+#define MPA_FLAGS_MASK		0xE0
+
+#define put_ep(ep) { \
+	PDBG("put_ep (via %s:%u) ep %p refcnt %d\n", __FUNCTION__, __LINE__,  \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_put(&((ep)->kref), __free_ep); \
+}
+
+#define get_ep(ep) { \
+	PDBG("get_ep (via %s:%u) ep %p, refcnt %d\n", __FUNCTION__, __LINE__, \
+	     ep, atomic_read(&((ep)->kref.refcount))); \
+	kref_get(&((ep)->kref));  \
+}
+
+struct mpa_message {
+	u8 key[16];
+	u8 flags;
+	u8 revision;
+	__be16 private_data_size;
+	u8 private_data[0];
+};
+
+struct terminate_message {
+	u8 layer_etype;
+	u8 ecode;
+	__be16 hdrct_rsvd;
+	u8 len_hdrs[0];
+};
+
+#define TERM_MAX_LENGTH (sizeof(struct terminate_message) + 2 + 18 + 28)
+
+enum iwch_layers_types {
+	LAYER_RDMAP 		= 0x00,
+	LAYER_DDP		= 0x10,
+	LAYER_MPA		= 0x20,
+	RDMAP_LOCAL_CATA	= 0x00,
+	RDMAP_REMOTE_PROT	= 0x01,
+	RDMAP_REMOTE_OP		= 0x02,
+	DDP_LOCAL_CATA		= 0x00,
+	DDP_TAGGED_ERR		= 0x01,
+	DDP_UNTAGGED_ERR	= 0x02,
+	DDP_LLP			= 0x03
+};
+
+enum iwch_rdma_ecodes {
+	RDMAP_INV_STAG		= 0x00,
+	RDMAP_BASE_BOUNDS	= 0x01,
+	RDMAP_ACC_VIOL		= 0x02,
+	RDMAP_STAG_NOT_ASSOC	= 0x03,
+	RDMAP_TO_WRAP		= 0x04,
+	RDMAP_INV_VERS		= 0x05,
+	RDMAP_INV_OPCODE	= 0x06,
+	RDMAP_STREAM_CATA	= 0x07,
+	RDMAP_GLOBAL_CATA	= 0x08,
+	RDMAP_CANT_INV_STAG	= 0x09,
+	RDMAP_UNSPECIFIED	= 0xff	
+};
+
+enum iwch_ddp_ecodes {
+	DDPT_INV_STAG		= 0x00,
+	DDPT_BASE_BOUNDS	= 0x01,
+	DDPT_STAG_NOT_ASSOC	= 0x02,
+	DDPT_TO_WRAP		= 0x03,
+	DDPT_INV_VERS		= 0x04,
+	DDPU_INV_QN		= 0x01,
+	DDPU_INV_MSN_NOBUF	= 0x02,
+	DDPU_INV_MSN_RANGE	= 0x03,
+	DDPU_INV_MO		= 0x04,
+	DDPU_MSG_TOOBIG		= 0x05,
+	DDPU_INV_VERS		= 0x06
+};
+
+enum iwch_mpa_ecodes {
+	MPA_CRC_ERR		= 0x02,
+	MPA_MARKER_ERR		= 0x03
+};
+
+enum iwch_ep_state {
+	IDLE = 0,
+	LISTEN,	
+	CONNECTING,
+	MPA_REQ_WAIT,
+	MPA_REQ_SENT,
+	MPA_REQ_RCVD,
+	MPA_REP_SENT,
+	FPDU_MODE,
+	ABORTING,
+	CLOSING,
+	MORIBUND,
+	DEAD,
+};
+
+struct iwch_ep_common {
+	struct iw_cm_id *cm_id;
+	struct iwch_qp *qp;
+	struct t3cdev *tdev;
+	enum iwch_ep_state state;
+	struct kref kref;
+	spinlock_t lock;
+	struct sockaddr_in local_addr;
+	struct sockaddr_in remote_addr;
+	wait_queue_head_t waitq;
+	int rpl_done;
+	int rpl_err;
+};
+
+struct iwch_listen_ep {
+	struct iwch_ep_common com;
+	unsigned int stid;
+	int backlog;
+};
+
+struct iwch_ep {
+	struct iwch_ep_common com;
+	struct iwch_ep *parent_ep;
+	struct timer_list timer;
+	unsigned int atid;
+	u32 hwtid;
+	u32 snd_seq;
+	struct l2t_entry *l2t;
+	struct dst_entry *dst;
+	struct sk_buff *mpa_skb;
+	struct iwch_mpa_attributes mpa_attr;
+	unsigned int mpa_pkt_len;
+	u8 mpa_pkt[sizeof(struct mpa_message) + MPA_MAX_PRIVATE_DATA];
+	u8 tos;
+	u16 emss;
+	u16 plen;
+	u32 ird;
+	u32 ord;
+};
+
+static inline struct iwch_ep *to_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_ep *)cm_id->provider_data;
+}
+
+static inline struct iwch_listen_ep *to_listen_ep(struct iw_cm_id *cm_id)
+{
+	return (struct iwch_listen_ep *)cm_id->provider_data;
+}
+
+static inline int compute_wscale(int win)
+{
+	int wscale = 0;
+
+	while (wscale < 14 && (65535<<wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+/* CM prototypes */
+
+int iwch_connect(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_create_listen(struct iw_cm_id *cm_id, int backlog);
+int iwch_destroy_listen(struct iw_cm_id *cm_id);
+int iwch_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
+int iwch_accept_cr(struct iw_cm_id *cm_id, struct iw_cm_conn_param *conn_param);
+int iwch_ep_disconnect(struct iwch_ep *ep, int abrupt, gfp_t gfp);
+int iwch_quiesce_tid(struct iwch_ep *ep);
+int iwch_resume_tid(struct iwch_ep *ep);
+void __free_ep(struct kref *kref);
+void iwch_rearp(struct iwch_ep *ep);
+int iwch_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new, struct l2t_entry *l2t);
+
+int __init iwch_cm_init(void);
+void __exit iwch_cm_term(void);
+
+#endif				/* _IWCH_CM_H_ */
diff --git a/drivers/infiniband/hw/cxgb3/tcb.h b/drivers/infiniband/hw/cxgb3/tcb.h
new file mode 100644
index 0000000..f287a7c
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/tcb.h
@@ -0,0 +1,603 @@
+/* This file is automatically generated --- do not edit */
+
+#ifndef _TCB_DEFS_H
+#define _TCB_DEFS_H
+
+#define W_TCB_T_STATE    0
+#define S_TCB_T_STATE    0
+#define M_TCB_T_STATE    0xfULL
+#define V_TCB_T_STATE(x) ((x) << S_TCB_T_STATE)
+
+#define W_TCB_TIMER    0
+#define S_TCB_TIMER    4
+#define M_TCB_TIMER    0x1ULL
+#define V_TCB_TIMER(x) ((x) << S_TCB_TIMER)
+
+#define W_TCB_DACK_TIMER    0
+#define S_TCB_DACK_TIMER    5
+#define M_TCB_DACK_TIMER    0x1ULL
+#define V_TCB_DACK_TIMER(x) ((x) << S_TCB_DACK_TIMER)
+
+#define W_TCB_DEL_FLAG    0
+#define S_TCB_DEL_FLAG    6
+#define M_TCB_DEL_FLAG    0x1ULL
+#define V_TCB_DEL_FLAG(x) ((x) << S_TCB_DEL_FLAG)
+
+#define W_TCB_L2T_IX    0
+#define S_TCB_L2T_IX    7
+#define M_TCB_L2T_IX    0x7ffULL
+#define V_TCB_L2T_IX(x) ((x) << S_TCB_L2T_IX)
+
+#define W_TCB_SMAC_SEL    0
+#define S_TCB_SMAC_SEL    18
+#define M_TCB_SMAC_SEL    0x3ULL
+#define V_TCB_SMAC_SEL(x) ((x) << S_TCB_SMAC_SEL)
+
+#define W_TCB_TOS    0
+#define S_TCB_TOS    20
+#define M_TCB_TOS    0x3fULL
+#define V_TCB_TOS(x) ((x) << S_TCB_TOS)
+
+#define W_TCB_MAX_RT    0
+#define S_TCB_MAX_RT    26
+#define M_TCB_MAX_RT    0xfULL
+#define V_TCB_MAX_RT(x) ((x) << S_TCB_MAX_RT)
+
+#define W_TCB_T_RXTSHIFT    0
+#define S_TCB_T_RXTSHIFT    30
+#define M_TCB_T_RXTSHIFT    0xfULL
+#define V_TCB_T_RXTSHIFT(x) ((x) << S_TCB_T_RXTSHIFT)
+
+#define W_TCB_T_DUPACKS    1
+#define S_TCB_T_DUPACKS    2
+#define M_TCB_T_DUPACKS    0xfULL
+#define V_TCB_T_DUPACKS(x) ((x) << S_TCB_T_DUPACKS)
+
+#define W_TCB_T_MAXSEG    1
+#define S_TCB_T_MAXSEG    6
+#define M_TCB_T_MAXSEG    0xfULL
+#define V_TCB_T_MAXSEG(x) ((x) << S_TCB_T_MAXSEG)
+
+#define W_TCB_T_FLAGS1    1
+#define S_TCB_T_FLAGS1    10
+#define M_TCB_T_FLAGS1    0xffffffffULL
+#define V_TCB_T_FLAGS1(x) ((x) << S_TCB_T_FLAGS1)
+
+#define W_TCB_T_MIGRATION    1
+#define S_TCB_T_MIGRATION    20
+#define M_TCB_T_MIGRATION    0x1ULL
+#define V_TCB_T_MIGRATION(x) ((x) << S_TCB_T_MIGRATION)
+
+#define W_TCB_T_FLAGS2    2
+#define S_TCB_T_FLAGS2    10
+#define M_TCB_T_FLAGS2    0x7fULL
+#define V_TCB_T_FLAGS2(x) ((x) << S_TCB_T_FLAGS2)
+
+#define W_TCB_SND_SCALE    2
+#define S_TCB_SND_SCALE    17
+#define M_TCB_SND_SCALE    0xfULL
+#define V_TCB_SND_SCALE(x) ((x) << S_TCB_SND_SCALE)
+
+#define W_TCB_RCV_SCALE    2
+#define S_TCB_RCV_SCALE    21
+#define M_TCB_RCV_SCALE    0xfULL
+#define V_TCB_RCV_SCALE(x) ((x) << S_TCB_RCV_SCALE)
+
+#define W_TCB_SND_UNA_RAW    2
+#define S_TCB_SND_UNA_RAW    25
+#define M_TCB_SND_UNA_RAW    0x7ffffffULL
+#define V_TCB_SND_UNA_RAW(x) ((x) << S_TCB_SND_UNA_RAW)
+
+#define W_TCB_SND_NXT_RAW    3
+#define S_TCB_SND_NXT_RAW    20
+#define M_TCB_SND_NXT_RAW    0x7ffffffULL
+#define V_TCB_SND_NXT_RAW(x) ((x) << S_TCB_SND_NXT_RAW)
+
+#define W_TCB_RCV_NXT    4
+#define S_TCB_RCV_NXT    15
+#define M_TCB_RCV_NXT    0xffffffffULL
+#define V_TCB_RCV_NXT(x) ((x) << S_TCB_RCV_NXT)
+
+#define W_TCB_RCV_ADV    5
+#define S_TCB_RCV_ADV    15
+#define M_TCB_RCV_ADV    0xffffULL
+#define V_TCB_RCV_ADV(x) ((x) << S_TCB_RCV_ADV)
+
+#define W_TCB_SND_MAX_RAW    5
+#define S_TCB_SND_MAX_RAW    31
+#define M_TCB_SND_MAX_RAW    0x7ffffffULL
+#define V_TCB_SND_MAX_RAW(x) ((x) << S_TCB_SND_MAX_RAW)
+
+#define W_TCB_SND_CWND    6
+#define S_TCB_SND_CWND    26
+#define M_TCB_SND_CWND    0x7ffffffULL
+#define V_TCB_SND_CWND(x) ((x) << S_TCB_SND_CWND)
+
+#define W_TCB_SND_SSTHRESH    7
+#define S_TCB_SND_SSTHRESH    21
+#define M_TCB_SND_SSTHRESH    0x7ffffffULL
+#define V_TCB_SND_SSTHRESH(x) ((x) << S_TCB_SND_SSTHRESH)
+
+#define W_TCB_T_RTT_TS_RECENT_AGE    8
+#define S_TCB_T_RTT_TS_RECENT_AGE    16
+#define M_TCB_T_RTT_TS_RECENT_AGE    0xffffffffULL
+#define V_TCB_T_RTT_TS_RECENT_AGE(x) ((x) << S_TCB_T_RTT_TS_RECENT_AGE)
+
+#define W_TCB_T_RTSEQ_RECENT    9
+#define S_TCB_T_RTSEQ_RECENT    16
+#define M_TCB_T_RTSEQ_RECENT    0xffffffffULL
+#define V_TCB_T_RTSEQ_RECENT(x) ((x) << S_TCB_T_RTSEQ_RECENT)
+
+#define W_TCB_T_SRTT    10
+#define S_TCB_T_SRTT    16
+#define M_TCB_T_SRTT    0xffffULL
+#define V_TCB_T_SRTT(x) ((x) << S_TCB_T_SRTT)
+
+#define W_TCB_T_RTTVAR    11
+#define S_TCB_T_RTTVAR    0
+#define M_TCB_T_RTTVAR    0xffffULL
+#define V_TCB_T_RTTVAR(x) ((x) << S_TCB_T_RTTVAR)
+
+#define W_TCB_TS_LAST_ACK_SENT_RAW    11
+#define S_TCB_TS_LAST_ACK_SENT_RAW    16
+#define M_TCB_TS_LAST_ACK_SENT_RAW    0x7ffffffULL
+#define V_TCB_TS_LAST_ACK_SENT_RAW(x) ((x) << S_TCB_TS_LAST_ACK_SENT_RAW)
+
+#define W_TCB_DIP    12
+#define S_TCB_DIP    11
+#define M_TCB_DIP    0xffffffffULL
+#define V_TCB_DIP(x) ((x) << S_TCB_DIP)
+
+#define W_TCB_SIP    13
+#define S_TCB_SIP    11
+#define M_TCB_SIP    0xffffffffULL
+#define V_TCB_SIP(x) ((x) << S_TCB_SIP)
+
+#define W_TCB_DP    14
+#define S_TCB_DP    11
+#define M_TCB_DP    0xffffULL
+#define V_TCB_DP(x) ((x) << S_TCB_DP)
+
+#define W_TCB_SP    14
+#define S_TCB_SP    27
+#define M_TCB_SP    0xffffULL
+#define V_TCB_SP(x) ((x) << S_TCB_SP)
+
+#define W_TCB_TIMESTAMP    15
+#define S_TCB_TIMESTAMP    11
+#define M_TCB_TIMESTAMP    0xffffffffULL
+#define V_TCB_TIMESTAMP(x) ((x) << S_TCB_TIMESTAMP)
+
+#define W_TCB_TIMESTAMP_OFFSET    16
+#define S_TCB_TIMESTAMP_OFFSET    11
+#define M_TCB_TIMESTAMP_OFFSET    0xfULL
+#define V_TCB_TIMESTAMP_OFFSET(x) ((x) << S_TCB_TIMESTAMP_OFFSET)
+
+#define W_TCB_TX_MAX    16
+#define S_TCB_TX_MAX    15
+#define M_TCB_TX_MAX    0xffffffffULL
+#define V_TCB_TX_MAX(x) ((x) << S_TCB_TX_MAX)
+
+#define W_TCB_TX_HDR_PTR_RAW    17
+#define S_TCB_TX_HDR_PTR_RAW    15
+#define M_TCB_TX_HDR_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_HDR_PTR_RAW(x) ((x) << S_TCB_TX_HDR_PTR_RAW)
+
+#define W_TCB_TX_LAST_PTR_RAW    18
+#define S_TCB_TX_LAST_PTR_RAW    0
+#define M_TCB_TX_LAST_PTR_RAW    0x1ffffULL
+#define V_TCB_TX_LAST_PTR_RAW(x) ((x) << S_TCB_TX_LAST_PTR_RAW)
+
+#define W_TCB_TX_COMPACT    18
+#define S_TCB_TX_COMPACT    17
+#define M_TCB_TX_COMPACT    0x1ULL
+#define V_TCB_TX_COMPACT(x) ((x) << S_TCB_TX_COMPACT)
+
+#define W_TCB_RX_COMPACT    18
+#define S_TCB_RX_COMPACT    18
+#define M_TCB_RX_COMPACT    0x1ULL
+#define V_TCB_RX_COMPACT(x) ((x) << S_TCB_RX_COMPACT)
+
+#define W_TCB_RCV_WND    18
+#define S_TCB_RCV_WND    19
+#define M_TCB_RCV_WND    0x7ffffffULL
+#define V_TCB_RCV_WND(x) ((x) << S_TCB_RCV_WND)
+
+#define W_TCB_RX_HDR_OFFSET    19
+#define S_TCB_RX_HDR_OFFSET    14
+#define M_TCB_RX_HDR_OFFSET    0x7ffffffULL
+#define V_TCB_RX_HDR_OFFSET(x) ((x) << S_TCB_RX_HDR_OFFSET)
+
+#define W_TCB_RX_FRAG0_START_IDX_RAW    20
+#define S_TCB_RX_FRAG0_START_IDX_RAW    9
+#define M_TCB_RX_FRAG0_START_IDX_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG0_START_IDX_RAW(x) ((x) << S_TCB_RX_FRAG0_START_IDX_RAW)
+
+#define W_TCB_RX_FRAG1_START_IDX_OFFSET    21
+#define S_TCB_RX_FRAG1_START_IDX_OFFSET    4
+#define M_TCB_RX_FRAG1_START_IDX_OFFSET    0x7ffffffULL
+#define V_TCB_RX_FRAG1_START_IDX_OFFSET(x) ((x) << S_TCB_RX_FRAG1_START_IDX_OFFSET)
+
+#define W_TCB_RX_FRAG0_LEN    21
+#define S_TCB_RX_FRAG0_LEN    31
+#define M_TCB_RX_FRAG0_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG0_LEN(x) ((x) << S_TCB_RX_FRAG0_LEN)
+
+#define W_TCB_RX_FRAG1_LEN    22
+#define S_TCB_RX_FRAG1_LEN    26
+#define M_TCB_RX_FRAG1_LEN    0x7ffffffULL
+#define V_TCB_RX_FRAG1_LEN(x) ((x) << S_TCB_RX_FRAG1_LEN)
+
+#define W_TCB_NEWRENO_RECOVER    23
+#define S_TCB_NEWRENO_RECOVER    21
+#define M_TCB_NEWRENO_RECOVER    0x7ffffffULL
+#define V_TCB_NEWRENO_RECOVER(x) ((x) << S_TCB_NEWRENO_RECOVER)
+
+#define W_TCB_PDU_HAVE_LEN    24
+#define S_TCB_PDU_HAVE_LEN    16
+#define M_TCB_PDU_HAVE_LEN    0x1ULL
+#define V_TCB_PDU_HAVE_LEN(x) ((x) << S_TCB_PDU_HAVE_LEN)
+
+#define W_TCB_PDU_LEN    24
+#define S_TCB_PDU_LEN    17
+#define M_TCB_PDU_LEN    0xffffULL
+#define V_TCB_PDU_LEN(x) ((x) << S_TCB_PDU_LEN)
+
+#define W_TCB_RX_QUIESCE    25
+#define S_TCB_RX_QUIESCE    1
+#define M_TCB_RX_QUIESCE    0x1ULL
+#define V_TCB_RX_QUIESCE(x) ((x) << S_TCB_RX_QUIESCE)
+
+#define W_TCB_RX_PTR_RAW    25
+#define S_TCB_RX_PTR_RAW    2
+#define M_TCB_RX_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_PTR_RAW(x) ((x) << S_TCB_RX_PTR_RAW)
+
+#define W_TCB_CPU_NO    25
+#define S_TCB_CPU_NO    19
+#define M_TCB_CPU_NO    0x7fULL
+#define V_TCB_CPU_NO(x) ((x) << S_TCB_CPU_NO)
+
+#define W_TCB_ULP_TYPE    25
+#define S_TCB_ULP_TYPE    26
+#define M_TCB_ULP_TYPE    0xfULL
+#define V_TCB_ULP_TYPE(x) ((x) << S_TCB_ULP_TYPE)
+
+#define W_TCB_RX_FRAG1_PTR_RAW    25
+#define S_TCB_RX_FRAG1_PTR_RAW    30
+#define M_TCB_RX_FRAG1_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    26
+#define S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    15
+#define M_TCB_RX_FRAG2_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG2_START_IDX_OFFSET_RAW)
+
+#define W_TCB_RX_FRAG2_PTR_RAW    27
+#define S_TCB_RX_FRAG2_PTR_RAW    10
+#define M_TCB_RX_FRAG2_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG2_PTR_RAW(x) ((x) << S_TCB_RX_FRAG2_PTR_RAW)
+
+#define W_TCB_RX_FRAG2_LEN_RAW    27
+#define S_TCB_RX_FRAG2_LEN_RAW    27
+#define M_TCB_RX_FRAG2_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG2_LEN_RAW(x) ((x) << S_TCB_RX_FRAG2_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_PTR_RAW    28
+#define S_TCB_RX_FRAG3_PTR_RAW    22
+#define M_TCB_RX_FRAG3_PTR_RAW    0x1ffffULL
+#define V_TCB_RX_FRAG3_PTR_RAW(x) ((x) << S_TCB_RX_FRAG3_PTR_RAW)
+
+#define W_TCB_RX_FRAG3_LEN_RAW    29
+#define S_TCB_RX_FRAG3_LEN_RAW    7
+#define M_TCB_RX_FRAG3_LEN_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_LEN_RAW(x) ((x) << S_TCB_RX_FRAG3_LEN_RAW)
+
+#define W_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    30
+#define S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    2
+#define M_TCB_RX_FRAG3_START_IDX_OFFSET_RAW    0x7ffffffULL
+#define V_TCB_RX_FRAG3_START_IDX_OFFSET_RAW(x) ((x) << S_TCB_RX_FRAG3_START_IDX_OFFSET_RAW)
+
+#define W_TCB_PDU_HDR_LEN    30
+#define S_TCB_PDU_HDR_LEN    29
+#define M_TCB_PDU_HDR_LEN    0xffULL
+#define V_TCB_PDU_HDR_LEN(x) ((x) << S_TCB_PDU_HDR_LEN)
+
+#define W_TCB_SLUSH1    31
+#define S_TCB_SLUSH1    5
+#define M_TCB_SLUSH1    0x7ffffULL
+#define V_TCB_SLUSH1(x) ((x) << S_TCB_SLUSH1)
+
+#define W_TCB_ULP_RAW    31
+#define S_TCB_ULP_RAW    24
+#define M_TCB_ULP_RAW    0xffULL
+#define V_TCB_ULP_RAW(x) ((x) << S_TCB_ULP_RAW)
+
+#define W_TCB_DDP_RDMAP_VERSION    25
+#define S_TCB_DDP_RDMAP_VERSION    30
+#define M_TCB_DDP_RDMAP_VERSION    0x1ULL
+#define V_TCB_DDP_RDMAP_VERSION(x) ((x) << S_TCB_DDP_RDMAP_VERSION)
+
+#define W_TCB_MARKER_ENABLE_RX    25
+#define S_TCB_MARKER_ENABLE_RX    31
+#define M_TCB_MARKER_ENABLE_RX    0x1ULL
+#define V_TCB_MARKER_ENABLE_RX(x) ((x) << S_TCB_MARKER_ENABLE_RX)
+
+#define W_TCB_MARKER_ENABLE_TX    26
+#define S_TCB_MARKER_ENABLE_TX    0
+#define M_TCB_MARKER_ENABLE_TX    0x1ULL
+#define V_TCB_MARKER_ENABLE_TX(x) ((x) << S_TCB_MARKER_ENABLE_TX)
+
+#define W_TCB_CRC_ENABLE    26
+#define S_TCB_CRC_ENABLE    1
+#define M_TCB_CRC_ENABLE    0x1ULL
+#define V_TCB_CRC_ENABLE(x) ((x) << S_TCB_CRC_ENABLE)
+
+#define W_TCB_IRS_ULP    26
+#define S_TCB_IRS_ULP    2
+#define M_TCB_IRS_ULP    0x1ffULL
+#define V_TCB_IRS_ULP(x) ((x) << S_TCB_IRS_ULP)
+
+#define W_TCB_ISS_ULP    26
+#define S_TCB_ISS_ULP    11
+#define M_TCB_ISS_ULP    0x1ffULL
+#define V_TCB_ISS_ULP(x) ((x) << S_TCB_ISS_ULP)
+
+#define W_TCB_TX_PDU_LEN    26
+#define S_TCB_TX_PDU_LEN    20
+#define M_TCB_TX_PDU_LEN    0x3fffULL
+#define V_TCB_TX_PDU_LEN(x) ((x) << S_TCB_TX_PDU_LEN)
+
+#define W_TCB_TX_PDU_OUT    27
+#define S_TCB_TX_PDU_OUT    2
+#define M_TCB_TX_PDU_OUT    0x1ULL
+#define V_TCB_TX_PDU_OUT(x) ((x) << S_TCB_TX_PDU_OUT)
+
+#define W_TCB_CQ_IDX_SQ    27
+#define S_TCB_CQ_IDX_SQ    3
+#define M_TCB_CQ_IDX_SQ    0xffffULL
+#define V_TCB_CQ_IDX_SQ(x) ((x) << S_TCB_CQ_IDX_SQ)
+
+#define W_TCB_CQ_IDX_RQ    27
+#define S_TCB_CQ_IDX_RQ    19
+#define M_TCB_CQ_IDX_RQ    0xffffULL
+#define V_TCB_CQ_IDX_RQ(x) ((x) << S_TCB_CQ_IDX_RQ)
+
+#define W_TCB_QP_ID    28
+#define S_TCB_QP_ID    3
+#define M_TCB_QP_ID    0xffffULL
+#define V_TCB_QP_ID(x) ((x) << S_TCB_QP_ID)
+
+#define W_TCB_PD_ID    28
+#define S_TCB_PD_ID    19
+#define M_TCB_PD_ID    0xffffULL
+#define V_TCB_PD_ID(x) ((x) << S_TCB_PD_ID)
+
+#define W_TCB_STAG    29
+#define S_TCB_STAG    3
+#define M_TCB_STAG    0xffffffffULL
+#define V_TCB_STAG(x) ((x) << S_TCB_STAG)
+
+#define W_TCB_RQ_START    30
+#define S_TCB_RQ_START    3
+#define M_TCB_RQ_START    0x3ffffffULL
+#define V_TCB_RQ_START(x) ((x) << S_TCB_RQ_START)
+
+#define W_TCB_RQ_MSN    30
+#define S_TCB_RQ_MSN    29
+#define M_TCB_RQ_MSN    0x3ffULL
+#define V_TCB_RQ_MSN(x) ((x) << S_TCB_RQ_MSN)
+
+#define W_TCB_RQ_MAX_OFFSET    31
+#define S_TCB_RQ_MAX_OFFSET    7
+#define M_TCB_RQ_MAX_OFFSET    0xfULL
+#define V_TCB_RQ_MAX_OFFSET(x) ((x) << S_TCB_RQ_MAX_OFFSET)
+
+#define W_TCB_RQ_WRITE_PTR    31
+#define S_TCB_RQ_WRITE_PTR    11
+#define M_TCB_RQ_WRITE_PTR    0x3ffULL
+#define V_TCB_RQ_WRITE_PTR(x) ((x) << S_TCB_RQ_WRITE_PTR)
+
+#define W_TCB_INB_WRITE_PERM    31
+#define S_TCB_INB_WRITE_PERM    21
+#define M_TCB_INB_WRITE_PERM    0x1ULL
+#define V_TCB_INB_WRITE_PERM(x) ((x) << S_TCB_INB_WRITE_PERM)
+
+#define W_TCB_INB_READ_PERM    31
+#define S_TCB_INB_READ_PERM    22
+#define M_TCB_INB_READ_PERM    0x1ULL
+#define V_TCB_INB_READ_PERM(x) ((x) << S_TCB_INB_READ_PERM)
+
+#define W_TCB_ORD_L_BIT_VLD    31
+#define S_TCB_ORD_L_BIT_VLD    23
+#define M_TCB_ORD_L_BIT_VLD    0x1ULL
+#define V_TCB_ORD_L_BIT_VLD(x) ((x) << S_TCB_ORD_L_BIT_VLD)
+
+#define W_TCB_RDMAP_OPCODE    31
+#define S_TCB_RDMAP_OPCODE    24
+#define M_TCB_RDMAP_OPCODE    0xfULL
+#define V_TCB_RDMAP_OPCODE(x) ((x) << S_TCB_RDMAP_OPCODE)
+
+#define W_TCB_TX_FLUSH    31
+#define S_TCB_TX_FLUSH    28
+#define M_TCB_TX_FLUSH    0x1ULL
+#define V_TCB_TX_FLUSH(x) ((x) << S_TCB_TX_FLUSH)
+
+#define W_TCB_TX_OOS_RXMT    31
+#define S_TCB_TX_OOS_RXMT    29
+#define M_TCB_TX_OOS_RXMT    0x1ULL
+#define V_TCB_TX_OOS_RXMT(x) ((x) << S_TCB_TX_OOS_RXMT)
+
+#define W_TCB_TX_OOS_TXMT    31
+#define S_TCB_TX_OOS_TXMT    30
+#define M_TCB_TX_OOS_TXMT    0x1ULL
+#define V_TCB_TX_OOS_TXMT(x) ((x) << S_TCB_TX_OOS_TXMT)
+
+#define W_TCB_SLUSH_AUX2    31
+#define S_TCB_SLUSH_AUX2    31
+#define M_TCB_SLUSH_AUX2    0x1ULL
+#define V_TCB_SLUSH_AUX2(x) ((x) << S_TCB_SLUSH_AUX2)
+
+#define W_TCB_RX_FRAG1_PTR_RAW2    25
+#define S_TCB_RX_FRAG1_PTR_RAW2    30
+#define M_TCB_RX_FRAG1_PTR_RAW2    0x1ffffULL
+#define V_TCB_RX_FRAG1_PTR_RAW2(x) ((x) << S_TCB_RX_FRAG1_PTR_RAW2)
+
+#define W_TCB_RX_DDP_FLAGS    26
+#define S_TCB_RX_DDP_FLAGS    15
+#define M_TCB_RX_DDP_FLAGS    0x3ffULL
+#define V_TCB_RX_DDP_FLAGS(x) ((x) << S_TCB_RX_DDP_FLAGS)
+
+#define W_TCB_SLUSH_AUX3    26
+#define S_TCB_SLUSH_AUX3    31
+#define M_TCB_SLUSH_AUX3    0x1ffULL
+#define V_TCB_SLUSH_AUX3(x) ((x) << S_TCB_SLUSH_AUX3)
+
+#define W_TCB_RX_DDP_BUF0_OFFSET    27
+#define S_TCB_RX_DDP_BUF0_OFFSET    8
+#define M_TCB_RX_DDP_BUF0_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF0_OFFSET)
+
+#define W_TCB_RX_DDP_BUF0_LEN    27
+#define S_TCB_RX_DDP_BUF0_LEN    30
+#define M_TCB_RX_DDP_BUF0_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF0_LEN(x) ((x) << S_TCB_RX_DDP_BUF0_LEN)
+
+#define W_TCB_RX_DDP_BUF1_OFFSET    28
+#define S_TCB_RX_DDP_BUF1_OFFSET    20
+#define M_TCB_RX_DDP_BUF1_OFFSET    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_OFFSET(x) ((x) << S_TCB_RX_DDP_BUF1_OFFSET)
+
+#define W_TCB_RX_DDP_BUF1_LEN    29
+#define S_TCB_RX_DDP_BUF1_LEN    10
+#define M_TCB_RX_DDP_BUF1_LEN    0x3fffffULL
+#define V_TCB_RX_DDP_BUF1_LEN(x) ((x) << S_TCB_RX_DDP_BUF1_LEN)
+
+#define W_TCB_RX_DDP_BUF0_TAG    30
+#define S_TCB_RX_DDP_BUF0_TAG    0
+#define M_TCB_RX_DDP_BUF0_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF0_TAG(x) ((x) << S_TCB_RX_DDP_BUF0_TAG)
+
+#define W_TCB_RX_DDP_BUF1_TAG    31
+#define S_TCB_RX_DDP_BUF1_TAG    0
+#define M_TCB_RX_DDP_BUF1_TAG    0xffffffffULL
+#define V_TCB_RX_DDP_BUF1_TAG(x) ((x) << S_TCB_RX_DDP_BUF1_TAG)
+
+#define S_TF_DACK    10
+#define V_TF_DACK(x) ((x) << S_TF_DACK)
+
+#define S_TF_NAGLE    11
+#define V_TF_NAGLE(x) ((x) << S_TF_NAGLE)
+
+#define S_TF_RECV_SCALE    12
+#define V_TF_RECV_SCALE(x) ((x) << S_TF_RECV_SCALE)
+
+#define S_TF_RECV_TSTMP    13
+#define V_TF_RECV_TSTMP(x) ((x) << S_TF_RECV_TSTMP)
+
+#define S_TF_RECV_SACK    14
+#define V_TF_RECV_SACK(x) ((x) << S_TF_RECV_SACK)
+
+#define S_TF_TURBO    15
+#define V_TF_TURBO(x) ((x) << S_TF_TURBO)
+
+#define S_TF_KEEPALIVE    16
+#define V_TF_KEEPALIVE(x) ((x) << S_TF_KEEPALIVE)
+
+#define S_TF_TCAM_BYPASS    17
+#define V_TF_TCAM_BYPASS(x) ((x) << S_TF_TCAM_BYPASS)
+
+#define S_TF_CORE_FIN    18
+#define V_TF_CORE_FIN(x) ((x) << S_TF_CORE_FIN)
+
+#define S_TF_CORE_MORE    19
+#define V_TF_CORE_MORE(x) ((x) << S_TF_CORE_MORE)
+
+#define S_TF_MIGRATING    20
+#define V_TF_MIGRATING(x) ((x) << S_TF_MIGRATING)
+
+#define S_TF_ACTIVE_OPEN    21
+#define V_TF_ACTIVE_OPEN(x) ((x) << S_TF_ACTIVE_OPEN)
+
+#define S_TF_ASK_MODE    22
+#define V_TF_ASK_MODE(x) ((x) << S_TF_ASK_MODE)
+
+#define S_TF_NON_OFFLOAD    23
+#define V_TF_NON_OFFLOAD(x) ((x) << S_TF_NON_OFFLOAD)
+
+#define S_TF_MOD_SCHD    24
+#define V_TF_MOD_SCHD(x) ((x) << S_TF_MOD_SCHD)
+
+#define S_TF_MOD_SCHD_REASON0    25
+#define V_TF_MOD_SCHD_REASON0(x) ((x) << S_TF_MOD_SCHD_REASON0)
+
+#define S_TF_MOD_SCHD_REASON1    26
+#define V_TF_MOD_SCHD_REASON1(x) ((x) << S_TF_MOD_SCHD_REASON1)
+
+#define S_TF_MOD_SCHD_RX    27
+#define V_TF_MOD_SCHD_RX(x) ((x) << S_TF_MOD_SCHD_RX)
+
+#define S_TF_CORE_PUSH    28
+#define V_TF_CORE_PUSH(x) ((x) << S_TF_CORE_PUSH)
+
+#define S_TF_RCV_COALESCE_ENABLE    29
+#define V_TF_RCV_COALESCE_ENABLE(x) ((x) << S_TF_RCV_COALESCE_ENABLE)
+
+#define S_TF_RCV_COALESCE_PUSH    30
+#define V_TF_RCV_COALESCE_PUSH(x) ((x) << S_TF_RCV_COALESCE_PUSH)
+
+#define S_TF_RCV_COALESCE_LAST_PSH    31
+#define V_TF_RCV_COALESCE_LAST_PSH(x) ((x) << S_TF_RCV_COALESCE_LAST_PSH)
+
+#define S_TF_RCV_COALESCE_HEARTBEAT    32
+#define V_TF_RCV_COALESCE_HEARTBEAT(x) ((x) << S_TF_RCV_COALESCE_HEARTBEAT)
+
+#define S_TF_HALF_CLOSE    33
+#define V_TF_HALF_CLOSE(x) ((x) << S_TF_HALF_CLOSE)
+
+#define S_TF_DACK_MSS    34
+#define V_TF_DACK_MSS(x) ((x) << S_TF_DACK_MSS)
+
+#define S_TF_CCTRL_SEL0    35
+#define V_TF_CCTRL_SEL0(x) ((x) << S_TF_CCTRL_SEL0)
+
+#define S_TF_CCTRL_SEL1    36
+#define V_TF_CCTRL_SEL1(x) ((x) << S_TF_CCTRL_SEL1)
+
+#define S_TF_TCP_NEWRENO_FAST_RECOVERY    37
+#define V_TF_TCP_NEWRENO_FAST_RECOVERY(x) ((x) << S_TF_TCP_NEWRENO_FAST_RECOVERY)
+
+#define S_TF_TX_PACE_AUTO    38
+#define V_TF_TX_PACE_AUTO(x) ((x) << S_TF_TX_PACE_AUTO)
+
+#define S_TF_PEER_FIN_HELD    39
+#define V_TF_PEER_FIN_HELD(x) ((x) << S_TF_PEER_FIN_HELD)
+
+#define S_TF_CORE_URG    40
+#define V_TF_CORE_URG(x) ((x) << S_TF_CORE_URG)
+
+#define S_TF_RDMA_ERROR    41
+#define V_TF_RDMA_ERROR(x) ((x) << S_TF_RDMA_ERROR)
+
+#define S_TF_SSWS_DISABLED    42
+#define V_TF_SSWS_DISABLED(x) ((x) << S_TF_SSWS_DISABLED)
+
+#define S_TF_DUPACK_COUNT_ODD    43
+#define V_TF_DUPACK_COUNT_ODD(x) ((x) << S_TF_DUPACK_COUNT_ODD)
+
+#define S_TF_TX_CHANNEL    44
+#define V_TF_TX_CHANNEL(x) ((x) << S_TF_TX_CHANNEL)
+
+#define S_TF_RX_CHANNEL    45
+#define V_TF_RX_CHANNEL(x) ((x) << S_TF_RX_CHANNEL)
+
+#define S_TF_TX_PACE_FIXED    46
+#define V_TF_TX_PACE_FIXED(x) ((x) << S_TF_TX_PACE_FIXED)
+
+#define S_TF_RDMA_FLM_ERROR    47
+#define V_TF_RDMA_FLM_ERROR(x) ((x) << S_TF_RDMA_FLM_ERROR)
+
+#define S_TF_RX_FLOW_CONTROL_DISABLE    48
+#define V_TF_RX_FLOW_CONTROL_DISABLE(x) ((x) << S_TF_RX_FLOW_CONTROL_DISABLE)
+
+#endif /* _TCB_DEFS_H */


From swise at opengridcomputing.com  Wed Dec 20 11:20:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:20:25 -0600
Subject: [openib-general] [PATCH  v5 05/13] iw_cxgb3 Queue Pairs
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192025.19316.13831.stgit@dell3.ogc.int>


Code to manipulate the QP.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_qp.c | 1007 +++++++++++++++++++++++++++++++++
 1 files changed, 1007 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c b/drivers/infiniband/hw/cxgb3/iwch_qp.c
new file mode 100644
index 0000000..ad044bd
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -0,0 +1,1007 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+
+#define NO_SUPPORT -1
+
+static inline int iwch_build_rdma_send(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 * flit_cnt)
+{
+	int i;
+	u32 plen;
+
+	switch (wr->opcode) {
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			wqe->send.rdmaop = T3_SEND_WITH_SE;
+		else
+			wqe->send.rdmaop = T3_SEND;
+		wqe->send.rem_stag = 0;
+		break;
+#if 0				/* Not currently supported */
+	case TYPE_SEND_INVALIDATE:
+	case TYPE_SEND_INVALIDATE_IMMEDIATE:
+		wqe->send.rdmaop = T3_SEND_WITH_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+	case TYPE_SEND_SE_INVALIDATE:
+		wqe->send.rdmaop = T3_SEND_WITH_SE_INV;
+		wqe->send.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+		break;
+#endif
+	default:
+		break;
+	}
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->send.reserved[0] = 0;
+	wqe->send.reserved[1] = 0;
+	wqe->send.reserved[2] = 0;
+	if (wr->opcode == IB_WR_SEND_WITH_IMM) {
+		plen = 4;
+		wqe->send.sgl[0].stag = wr->imm_data;
+		wqe->send.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->send.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 5;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->send.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->send.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->send.sgl[i].to = cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->send.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 4 + ((wr->num_sge) << 1);
+	}
+	wqe->send.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_write(union t3_wr *wqe, struct ib_send_wr *wr,
+					u8 *flit_cnt)
+{
+	int i;
+	u32 plen;
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	wqe->write.rdmaop = T3_RDMA_WRITE;
+	wqe->write.reserved[0] = 0;
+	wqe->write.reserved[1] = 0;
+	wqe->write.reserved[2] = 0;
+	wqe->write.stag_sink = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->write.to_sink = cpu_to_be64(wr->wr.rdma.remote_addr);
+
+	if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM) {
+		plen = 4;
+		wqe->write.sgl[0].stag = wr->imm_data;
+		wqe->write.sgl[0].len = __constant_cpu_to_be32(0);
+		wqe->write.num_sgle = __constant_cpu_to_be32(0);
+		*flit_cnt = 6;
+	} else {
+		plen = 0;
+		for (i = 0; i < wr->num_sge; i++) {
+			if ((plen + wr->sg_list[i].length) < plen) {
+				return -EMSGSIZE;
+			}
+			plen += wr->sg_list[i].length;
+			wqe->write.sgl[i].stag =
+			    cpu_to_be32(wr->sg_list[i].lkey);
+			wqe->write.sgl[i].len =
+			    cpu_to_be32(wr->sg_list[i].length);
+			wqe->write.sgl[i].to =
+			    cpu_to_be64(wr->sg_list[i].addr);
+		}
+		wqe->write.num_sgle = cpu_to_be32(wr->num_sge);
+		*flit_cnt = 5 + ((wr->num_sge) << 1);
+	}
+	wqe->write.plen = cpu_to_be32(plen);
+	return 0;
+}
+
+static inline int iwch_build_rdma_read(union t3_wr *wqe, struct ib_send_wr *wr,
+				       u8 *flit_cnt)
+{
+	if (wr->num_sge > 1)
+		return -EINVAL;
+	wqe->read.rdmaop = T3_READ_REQ;
+	wqe->read.reserved[0] = 0;
+	wqe->read.reserved[1] = 0;
+	wqe->read.reserved[2] = 0;
+	wqe->read.rem_stag = cpu_to_be32(wr->wr.rdma.rkey);
+	wqe->read.rem_to = cpu_to_be64(wr->wr.rdma.remote_addr);
+	wqe->read.local_stag = cpu_to_be32(wr->sg_list[0].lkey);
+	wqe->read.local_len = cpu_to_be32(wr->sg_list[0].length);
+	wqe->read.local_to = cpu_to_be64(wr->sg_list[0].addr);
+	*flit_cnt = sizeof(struct t3_rdma_read_wr) >> 3;
+	return 0;
+}
+
+/*
+ * TBD: this is going to be moved to firmware. Missing pdid/qpid check for now.
+ */
+static inline int iwch_sgl2pbl_map(struct iwch_dev *rhp,
+				   struct ib_sge *sg_list, u32 num_sgle,
+				   u32 * pbl_addr, u8 * page_size)
+{
+	int i;
+	struct iwch_mr *mhp;
+	u32 offset;
+	for (i = 0; i < num_sgle; i++) {
+
+		mhp = get_mhp(rhp, (sg_list[i].lkey) >> 8);
+		if (!mhp) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (!mhp->attr.state) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+		if (mhp->attr.zbva) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EIO;
+		}
+
+		if (sg_list[i].addr < mhp->attr.va_fbo) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) <
+		    sg_list[i].addr) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		if (sg_list[i].addr + ((u64) sg_list[i].length) >
+		    mhp->attr.va_fbo + ((u64) mhp->attr.len)) {
+			PDBG("%s %d\n", __FUNCTION__, __LINE__);
+			return -EINVAL;
+		}
+		offset = sg_list[i].addr - mhp->attr.va_fbo;
+		offset += ((u32) mhp->attr.va_fbo) %
+		          (1UL << (12 + mhp->attr.page_size));
+		pbl_addr[i] = ((mhp->attr.pbl_addr -
+			        rhp->rdev.rnic_info.pbl_base) >> 3) +
+			      (offset >> (12 + mhp->attr.page_size));
+		page_size[i] = mhp->attr.page_size;
+	}
+	return 0;
+}
+
+static inline int iwch_build_rdma_recv(struct iwch_dev *rhp,
+						    union t3_wr *wqe,
+						    struct ib_recv_wr *wr)
+{
+	int i, err = 0;
+	u32 pbl_addr[4];
+	u8 page_size[4];
+	if (wr->num_sge > T3_MAX_SGE)
+		return -EINVAL;
+	err = iwch_sgl2pbl_map(rhp, wr->sg_list, wr->num_sge, pbl_addr,
+			       page_size);
+	if (err)
+		return err;
+	wqe->recv.pagesz[0] = page_size[0];
+	wqe->recv.pagesz[1] = page_size[1];
+	wqe->recv.pagesz[2] = page_size[2];
+	wqe->recv.pagesz[3] = page_size[3];
+	wqe->recv.num_sgle = cpu_to_be32(wr->num_sge);
+	for (i = 0; i < wr->num_sge; i++) {
+		wqe->recv.sgl[i].stag = cpu_to_be32(wr->sg_list[i].lkey);
+		wqe->recv.sgl[i].len = cpu_to_be32(wr->sg_list[i].length);
+		
+		/* to in the WQE == the offset into the page */
+		wqe->recv.sgl[i].to = cpu_to_be64(((u32) wr->sg_list[i].addr) %
+				(1UL << (12 + page_size[i])));
+
+		/* pbl_addr is the adapters address in the PBL */
+		wqe->recv.pbl_addr[i] = cpu_to_be32(pbl_addr[i]);
+	}
+	for (; i < T3_MAX_SGE; i++) {
+		wqe->recv.sgl[i].stag = 0;
+		wqe->recv.sgl[i].len = 0;
+		wqe->recv.sgl[i].to = 0;
+		wqe->recv.pbl_addr[i] = 0;
+	}
+	return 0;
+}
+
+int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
+		      struct ib_send_wr **bad_wr)
+{
+	int err = 0;
+	u8 t3_wr_flit_cnt;
+	enum t3_wr_opcode t3_wr_opcode = 0;
+	enum t3_wr_flags t3_wr_flags;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr,
+		  qhp->wq.sq_size_log2);
+	if (num_wrs <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	while (wr) {
+		if (num_wrs == 0) {
+			err = -ENOMEM;
+			*bad_wr = wr;
+			break;
+		}
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		t3_wr_flags = 0;
+		if (wr->send_flags & IB_SEND_SOLICITED)
+			t3_wr_flags |= T3_SOLICITED_EVENT_FLAG;
+		if (wr->send_flags & IB_SEND_FENCE)
+			t3_wr_flags |= T3_READ_FENCE_FLAG;
+		if (wr->send_flags & IB_SEND_SIGNALED)
+			t3_wr_flags |= T3_COMPLETION_FLAG;
+		sqp = qhp->wq.sq +
+		      Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+		switch (wr->opcode) {
+		case IB_WR_SEND:
+		case IB_WR_SEND_WITH_IMM:
+			t3_wr_opcode = T3_WR_SEND;
+			err = iwch_build_rdma_send(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_WRITE:
+		case IB_WR_RDMA_WRITE_WITH_IMM:
+			t3_wr_opcode = T3_WR_WRITE;
+			err = iwch_build_rdma_write(wqe, wr, &t3_wr_flit_cnt);
+			break;
+		case IB_WR_RDMA_READ:
+			t3_wr_opcode = T3_WR_READ;
+			t3_wr_flags = 0; /* T3 reads are always signaled */
+			err = iwch_build_rdma_read(wqe, wr, &t3_wr_flit_cnt);
+			if (err)
+				break;
+			sqp->read_len = wqe->read.local_len;
+			if (!qhp->wq.oldest_read)
+				qhp->wq.oldest_read = sqp;
+			break;
+		default:
+			PDBG("%s post of type=%d TBD!\n", __FUNCTION__,
+			     wr->opcode);
+			err = -EINVAL;
+		}
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+		sqp->wr_id = wr->wr_id;
+		sqp->opcode = wr2opcode(t3_wr_opcode);
+		sqp->sq_wptr = qhp->wq.sq_wptr;
+		sqp->complete = 0;
+		sqp->signaled = (wr->send_flags & IB_SEND_SIGNALED);
+
+		build_fw_riwrh((void *) wqe, t3_wr_opcode, t3_wr_flags,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, t3_wr_flit_cnt);
+		PDBG("%s cookie 0x%llx wq idx 0x%x swsq idx %ld opcode %d\n",
+		     __FUNCTION__, wr->wr_id, idx,
+		     Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2),
+		     sqp->opcode);
+		wr = wr->next;
+		num_wrs--;
+		++(qhp->wq.wptr);
+		++(qhp->wq.sq_wptr);
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_post_receive(struct ib_qp *ibqp, struct ib_recv_wr *wr,
+		      struct ib_recv_wr **bad_wr)
+{
+	int err = 0;
+	struct iwch_qp *qhp;
+	u32 idx;
+	union t3_wr *wqe;
+	u32 num_wrs;
+	unsigned long flag;
+
+	qhp = to_iwch_qp(ibqp);
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.rq_rptr, qhp->wq.rq_wptr,
+			    qhp->wq.rq_size_log2) - 1;
+	if (!wr) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	while (wr) {
+		idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+		wqe = (union t3_wr *) (qhp->wq.queue + idx);
+		if (num_wrs)
+			err = iwch_build_rdma_recv(qhp->rhp, wqe, wr);
+		else
+			err = -ENOMEM;
+		if (err) {
+			*bad_wr = wr;
+			break;
+		}
+		qhp->wq.rq[Q_PTR2IDX(qhp->wq.rq_wptr, qhp->wq.rq_size_log2)] =
+			wr->wr_id;
+		build_fw_riwrh((void *) wqe, T3_WR_RCV, T3_COMPLETION_FLAG,
+			       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2),
+			       0, sizeof(struct t3_receive_wr) >> 3);
+		PDBG("%s cookie 0x%llx idx 0x%x rq_wptr 0x%x rw_rptr 0x%x "
+		     "wqe %p \n", __FUNCTION__, wr->wr_id, idx,
+		     qhp->wq.rq_wptr, qhp->wq.rq_rptr, wqe);
+		++(qhp->wq.rq_wptr);
+		++(qhp->wq.wptr);
+		wr = wr->next;
+		num_wrs--;
+	}
+	spin_unlock_irqrestore(&qhp->lock, flag);
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+	return err;
+}
+
+int iwch_bind_mw(struct ib_qp *qp,
+			     struct ib_mw *mw,
+			     struct ib_mw_bind *mw_bind)
+{
+	struct iwch_dev *rhp;
+	struct iwch_mw *mhp;
+	struct iwch_qp *qhp;
+	union t3_wr *wqe;
+	u32 pbl_addr;
+	u8 page_size;
+	u32 num_wrs;
+	unsigned long flag;
+	struct ib_sge sgl;
+	int err=0;
+	enum t3_wr_flags t3_wr_flags;
+	u32 idx;
+	struct t3_swsq *sqp;
+
+	qhp = to_iwch_qp(qp);
+	mhp = to_iwch_mw(mw);
+	rhp = qhp->rhp;
+
+	spin_lock_irqsave(&qhp->lock, flag);
+	if (qhp->attr.state > IWCH_QP_STATE_RTS) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -EINVAL;
+	}
+	num_wrs = Q_FREECNT(qhp->wq.sq_rptr, qhp->wq.sq_wptr,
+			    qhp->wq.sq_size_log2);
+	if ((num_wrs) <= 0) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+		return -ENOMEM;
+	}
+	idx = Q_PTR2IDX(qhp->wq.wptr, qhp->wq.size_log2);
+	PDBG("%s: idx 0x%0x, mw 0x%p, mw_bind 0x%p\n", __FUNCTION__, idx,
+	     mw, mw_bind);
+	wqe = (union t3_wr *) (qhp->wq.queue + idx);
+
+	t3_wr_flags = 0;
+	if (mw_bind->send_flags & IB_SEND_SIGNALED)
+		t3_wr_flags = T3_COMPLETION_FLAG;
+
+        sgl.addr = mw_bind->addr;
+        sgl.lkey = mw_bind->mr->lkey;
+        sgl.length = mw_bind->length;
+        wqe->bind.reserved = 0;
+        wqe->bind.type = T3_VA_BASED_TO;
+
+        /* TBD: check perms */
+        wqe->bind.perms = iwch_convert_access(mw_bind->mw_access_flags);
+        wqe->bind.mr_stag = cpu_to_be32(mw_bind->mr->lkey);
+        wqe->bind.mw_stag = cpu_to_be32(mw->rkey);
+        wqe->bind.mw_len = cpu_to_be32(mw_bind->length);
+        wqe->bind.mw_va = cpu_to_be64(mw_bind->addr);
+        err = iwch_sgl2pbl_map(rhp, &sgl, 1, &pbl_addr, &page_size);
+        if (err) {
+		spin_unlock_irqrestore(&qhp->lock, flag);
+                return err;
+	}
+	wqe->send.wrid.id0.hi = qhp->wq.sq_wptr;
+	sqp = qhp->wq.sq + Q_PTR2IDX(qhp->wq.sq_wptr, qhp->wq.sq_size_log2);
+	sqp->wr_id = mw_bind->wr_id;
+	sqp->opcode = T3_BIND_MW;
+	sqp->sq_wptr = qhp->wq.sq_wptr;
+	sqp->complete = 0;
+	sqp->signaled = (mw_bind->send_flags & IB_SEND_SIGNALED);
+        wqe->bind.mr_pbl_addr = cpu_to_be32(pbl_addr);
+        wqe->bind.mr_pagesz = page_size;
+	wqe->flit[T3_SQ_COOKIE_FLIT] = mw_bind->wr_id;
+	build_fw_riwrh((void *)wqe, T3_WR_BIND, t3_wr_flags,
+		       Q_GENBIT(qhp->wq.wptr, qhp->wq.size_log2), 0,
+			        sizeof(struct t3_bind_mw_wr) >> 3);
+	++(qhp->wq.wptr);
+	++(qhp->wq.sq_wptr);
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	ring_doorbell(qhp->wq.doorbell, qhp->wq.qpid);
+
+	return err;
+}
+
+static inline void build_term_codes(int t3err, u8 *layer_type, u8 *ecode,
+				    int tagged)
+{
+	switch (t3err) {
+	case TPT_ERR_STAG:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_STAG;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_INV_STAG;
+		}
+		break;
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_STAG_NOT_ASSOC;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_STAG_NOT_ASSOC;
+		}
+		break;
+	case TPT_ERR_WRAP:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+		*ecode = RDMAP_TO_WRAP;
+		break;
+	case TPT_ERR_BOUND:
+		if (tagged == 1) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_BASE_BOUNDS;
+		} else if (tagged == 2) {
+			*layer_type = LAYER_RDMAP|RDMAP_REMOTE_PROT;
+			*ecode = RDMAP_BASE_BOUNDS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_MSG_TOOBIG;
+		}
+		break;
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_CANT_INV_STAG;
+		break;
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR:
+		*layer_type = LAYER_RDMAP|RDMAP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_OUT_OF_RQE:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_NOBUF;
+		break;
+	case TPT_ERR_PBL_ADDR_BOUND:
+		*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+		*ecode = DDPT_BASE_BOUNDS;
+		break;
+	case TPT_ERR_CRC:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_CRC_ERR;
+		break;
+	case TPT_ERR_MARKER:
+		*layer_type = LAYER_MPA|DDP_LLP;
+		*ecode = MPA_MARKER_ERR;
+		break;
+	case TPT_ERR_PDU_LEN_ERR:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_MSG_TOOBIG;
+		break;
+	case TPT_ERR_DDP_VERSION:
+		if (tagged) {
+			*layer_type = LAYER_DDP|DDP_TAGGED_ERR;
+			*ecode = DDPT_INV_VERS;
+		} else {
+			*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+			*ecode = DDPU_INV_VERS;
+		}
+		break;
+	case TPT_ERR_RDMA_VERSION:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_VERS;
+		break;
+	case TPT_ERR_OPCODE:
+		*layer_type = LAYER_RDMAP|RDMAP_REMOTE_OP;
+		*ecode = RDMAP_INV_OPCODE;
+		break;
+	case TPT_ERR_DDP_QUEUE_NUM:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_QN;
+		break;
+	case TPT_ERR_MSN:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_IRD_OVERFLOW:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MSN_RANGE;
+		break;
+	case TPT_ERR_TBIT:
+		*layer_type = LAYER_DDP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	case TPT_ERR_MO:
+		*layer_type = LAYER_DDP|DDP_UNTAGGED_ERR;
+		*ecode = DDPU_INV_MO;
+		break;
+	default:
+		*layer_type = LAYER_RDMAP|DDP_LOCAL_CATA;
+		*ecode = 0;
+		break;
+	}
+}
+
+/*
+ * This posts a TERMINATE with layer=RDMA, type=catastrophic.
+ */
+int iwch_post_terminate(struct iwch_qp *qhp, struct respQ_msg_t *rsp_msg)
+{
+	union t3_wr *wqe;
+	struct terminate_message *term;
+	int status;
+	int tagged = 0;
+	struct sk_buff *skb;
+
+	PDBG("%s %d\n", __FUNCTION__, __LINE__);
+	skb = alloc_skb(40, GFP_ATOMIC);
+	if (!skb) {
+		printk(KERN_ERR "%s cannot send TERMINATE!\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (union t3_wr *)skb_put(skb, 40);
+	memset(wqe, 0, 40);
+	wqe->send.rdmaop = T3_TERMINATE;
+	
+	/* immediate data length */
+	wqe->send.plen = htonl(4);
+
+	/* immediate data starts here. */
+	term = (struct terminate_message *)wqe->send.sgl;
+	if (rsp_msg) {
+		status = CQE_STATUS(rsp_msg->cqe);
+		if (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)
+			tagged = 1;
+		if ((CQE_OPCODE(rsp_msg->cqe) == T3_READ_REQ) ||
+		    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP))
+			tagged = 2;
+	} else {
+		status = TPT_ERR_INTERNAL_ERR;
+	}
+	build_term_codes(status, &term->layer_etype, &term->ecode, tagged);
+	build_fw_riwrh((void *)wqe, T3_WR_SEND,
+		       T3_COMPLETION_FLAG | T3_NOTIFY_FLAG, 1,
+		       qhp->ep->hwtid, 5);
+	skb->priority = CPL_PRIORITY_DATA;
+	return (cxgb3_ofld_send(qhp->rhp->rdev.t3cdev_p, skb));
+}
+
+/*
+ * Assumes qhp lock is held.
+ */
+static void __flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	struct iwch_cq *rchp, *schp;
+	int count;
+
+	rchp = get_chp(qhp->rhp, qhp->attr.rcq);
+	schp = get_chp(qhp->rhp, qhp->attr.scq);
+	
+	PDBG("%s qhp %p rchp %p schp %p\n", __FUNCTION__, qhp, rchp, schp);
+	/* take a ref on the qhp since we must release the lock */
+	atomic_inc(&qhp->refcnt);
+	spin_unlock_irqrestore(&qhp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&rchp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&rchp->cq);
+	cxio_count_rcqes(&rchp->cq, &qhp->wq, &count);
+	cxio_flush_rq(&qhp->wq, &rchp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&rchp->lock, *flag);
+
+	/* locking heirarchy: cq lock first, then qp lock. */
+	spin_lock_irqsave(&schp->lock, *flag);
+	spin_lock(&qhp->lock);
+	cxio_flush_hw_cq(&schp->cq);
+	cxio_count_scqes(&schp->cq, &qhp->wq, &count);
+	cxio_flush_sq(&qhp->wq, &schp->cq, count);
+	spin_unlock(&qhp->lock);
+	spin_unlock_irqrestore(&schp->lock, *flag);
+
+	/* deref */
+	if (atomic_dec_and_test(&qhp->refcnt))
+                wake_up(&qhp->wait);
+
+	spin_lock_irqsave(&qhp->lock, *flag);
+}
+
+static inline void flush_qp(struct iwch_qp *qhp, unsigned long *flag)
+{
+	if (t3b_device(qhp->rhp))
+		cxio_set_wq_in_error(&qhp->wq);
+	else
+		__flush_qp(qhp, flag);
+}
+
+
+/*
+ * Return non zero if at least one RECV was pre-posted.
+ */
+static inline int rqes_posted(struct iwch_qp *qhp)
+{
+	return (fw_riwrh_opcode((struct fw_riwrh *)qhp->wq.queue) == T3_WR_RCV);
+}
+
+static int rdma_init(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs)
+{
+	struct t3_rdma_init_attr init_attr;
+	int ret;
+
+	init_attr.tid = qhp->ep->hwtid;
+	init_attr.qpid = qhp->wq.qpid;
+	init_attr.pdid = qhp->attr.pd;
+	init_attr.scqid = qhp->attr.scq;
+	init_attr.rcqid = qhp->attr.rcq;
+	init_attr.rq_addr = qhp->wq.rq_addr;
+	init_attr.rq_size = 1 << qhp->wq.rq_size_log2;
+	init_attr.mpaattrs = uP_RI_MPA_IETF_ENABLE |
+		qhp->attr.mpa_attr.recv_marker_enabled |
+		(qhp->attr.mpa_attr.xmit_marker_enabled << 1) |
+		(qhp->attr.mpa_attr.crc_enabled << 2);
+
+	/*
+	 * XXX - The IWCM doesn't quite handle getting these
+ 	 * attrs set before going into RTS.  For now, just turn
+	 * them on always...
+	 */
+#if 0
+	init_attr.qpcaps = qhp->attr.enableRdmaRead |
+		(qhp->attr.enableRdmaWrite << 1) |
+		(qhp->attr.enableBind << 2) |
+		(qhp->attr.enable_stag0_fastreg << 3) |
+		(qhp->attr.enable_stag0_fastreg << 4);
+#else
+	init_attr.qpcaps = 0x1f;
+#endif
+	init_attr.tcp_emss = qhp->ep->emss;
+	init_attr.ord = qhp->attr.max_ord;
+	init_attr.ird = qhp->attr.max_ird;
+	init_attr.qp_dma_addr = qhp->wq.dma_addr;
+	init_attr.qp_dma_size = (1UL << qhp->wq.size_log2);
+	init_attr.flags = rqes_posted(qhp) ? RECVS_POSTED : 0;
+	PDBG("%s init_attr.rq_addr 0x%x init_attr.rq_size = %d "
+	     "flags 0x%x qpcaps 0x%x\n", __FUNCTION__,
+	     init_attr.rq_addr, init_attr.rq_size,
+	     init_attr.flags, init_attr.qpcaps);
+	ret = cxio_rdma_init(&rhp->rdev, &init_attr);
+	PDBG("%s ret %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+int iwch_modify_qp(struct iwch_dev *rhp, struct iwch_qp *qhp,
+				enum iwch_qp_attr_mask mask,
+				struct iwch_qp_attributes *attrs,
+				int internal)
+{
+	int ret = 0;
+	struct iwch_qp_attributes newattr = qhp->attr;
+	unsigned long flag;
+	int disconnect = 0;
+	int terminate = 0;
+	int abort = 0;
+	int free = 0;
+	struct iwch_ep *ep = NULL;
+
+	PDBG("%s qhp %p qpid 0x%x ep %p state %d -> %d\n", __FUNCTION__,
+	     qhp, qhp->wq.qpid, qhp->ep, qhp->attr.state,
+	     (mask & IWCH_QP_ATTR_NEXT_STATE) ? attrs->next_state : -1);
+
+	spin_lock_irqsave(&qhp->lock, flag);
+
+	/* Process attr changes if in IDLE */
+	if (mask & IWCH_QP_ATTR_VALID_MODIFY) {
+		if (qhp->attr.state != IWCH_QP_STATE_IDLE) {
+			ret = -EIO;
+			goto out;
+		}
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_READ)
+			newattr.enable_rdma_read = attrs->enable_rdma_read;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_WRITE)
+			newattr.enable_rdma_write = attrs->enable_rdma_write;
+		if (mask & IWCH_QP_ATTR_ENABLE_RDMA_BIND)
+			newattr.enable_bind = attrs->enable_bind;
+		if (mask & IWCH_QP_ATTR_MAX_ORD) {
+			if (attrs->max_ord >
+			    rhp->attr.max_rdma_read_qp_depth) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ord = attrs->max_ord;
+		}
+		if (mask & IWCH_QP_ATTR_MAX_IRD) {
+			if (attrs->max_ird >
+		  	    rhp->attr.max_rdma_reads_per_qp) {
+				ret = -EINVAL;
+				goto out;
+			}
+			newattr.max_ird = attrs->max_ird;
+		}
+		qhp->attr = newattr;
+	}
+	
+	if (!(mask & IWCH_QP_ATTR_NEXT_STATE))
+		goto out;
+	if (qhp->attr.state == attrs->next_state)
+		goto out;
+
+	switch (qhp->attr.state) {
+	case IWCH_QP_STATE_IDLE:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_RTS:
+			if (!(mask & IWCH_QP_ATTR_LLP_STREAM_HANDLE)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			if (!(mask & IWCH_QP_ATTR_MPA_ATTR)) {
+				ret = -EINVAL;
+				goto out;
+			}
+			qhp->attr.mpa_attr = attrs->mpa_attr;
+			qhp->attr.llp_stream_handle = attrs->llp_stream_handle;
+			qhp->ep = qhp->attr.llp_stream_handle;
+			qhp->attr.state = IWCH_QP_STATE_RTS;
+
+			/*
+			 * Ref the endpoint here and deref when we
+	 		 * disassociate the endpoint from the QP.  This
+			 * happens in CLOSING->IDLE transition or *->ERROR
+			 * transition.
+			 */
+			get_ep(&qhp->ep->com);
+			spin_unlock_irqrestore(&qhp->lock, flag);
+			ret = rdma_init(rhp, qhp, mask, attrs);
+			spin_lock_irqsave(&qhp->lock, flag);
+			if (ret)
+				goto err;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			flush_qp(qhp, &flag);
+			break;
+		default:
+			ret = -EINVAL;	
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_RTS:
+		switch (attrs->next_state) {
+		case IWCH_QP_STATE_CLOSING:
+			BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2);
+			qhp->attr.state = IWCH_QP_STATE_CLOSING;
+			if (!internal) {
+				abort=0;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			break;
+		case IWCH_QP_STATE_TERMINATE:
+			qhp->attr.state = IWCH_QP_STATE_TERMINATE;
+			if (!internal)
+				terminate = 1;
+			break;
+		case IWCH_QP_STATE_ERROR:
+			qhp->attr.state = IWCH_QP_STATE_ERROR;
+			if (!internal) {
+				abort=1;
+				disconnect = 1;
+				ep = qhp->ep;
+			}
+			goto err;
+			break;
+		default:
+			ret = -EINVAL;
+			goto out;
+		}
+		break;
+	case IWCH_QP_STATE_CLOSING:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		switch (attrs->next_state) {
+			case IWCH_QP_STATE_IDLE:
+				qhp->attr.state = IWCH_QP_STATE_IDLE;
+				qhp->attr.llp_stream_handle = NULL;
+				put_ep(&qhp->ep->com);
+				qhp->ep = NULL;
+				wake_up(&qhp->wait);
+				break;
+			case IWCH_QP_STATE_ERROR:
+				goto err;
+			default:
+				ret = -EINVAL;
+				goto err;
+		}
+		break;
+	case IWCH_QP_STATE_ERROR:
+		if (attrs->next_state != IWCH_QP_STATE_IDLE) {
+			ret = -EINVAL;
+			goto out;
+		}
+		
+		if (!Q_EMPTY(qhp->wq.sq_rptr, qhp->wq.sq_wptr) ||
+		    !Q_EMPTY(qhp->wq.rq_rptr, qhp->wq.rq_wptr)) {
+			ret = -EINVAL;
+			goto out;
+		}
+		qhp->attr.state = IWCH_QP_STATE_IDLE;
+		memset(&qhp->attr, 0, sizeof(qhp->attr));
+		break;
+	case IWCH_QP_STATE_TERMINATE:
+		if (!internal) {
+			ret = -EINVAL;
+			goto out;
+		}
+		goto err;
+		break;
+	default:
+		printk(KERN_ERR "%s in a bad state %d\n",
+		       __FUNCTION__, qhp->attr.state);
+		ret = -EINVAL;
+		goto err;
+		break;
+	}
+	goto out;
+err:
+	PDBG("%s disassociating ep %p qpid 0x%x\n", __FUNCTION__, qhp->ep,
+	     qhp->wq.qpid);
+
+	/* disassociate the LLP connection */
+	qhp->attr.llp_stream_handle = NULL;
+	ep = qhp->ep;
+	qhp->ep = NULL;
+	qhp->attr.state = IWCH_QP_STATE_ERROR;
+	free=1;
+	wake_up(&qhp->wait);
+	BUG_ON(!ep);
+	flush_qp(qhp, &flag);
+out:
+	spin_unlock_irqrestore(&qhp->lock, flag);
+
+	if (terminate)
+		iwch_post_terminate(qhp, NULL);
+
+	/*
+	 * If disconnect is 1, then we need to initiate a disconnect
+	 * on the EP.  This can be a normal close (RTS->CLOSING) or
+	 * an abnormal close (RTS/CLOSING->ERROR).
+	 */
+	if (disconnect)
+		iwch_ep_disconnect(ep, abort, GFP_KERNEL);
+
+	/*
+	 * If free is 1, then we've disassociated the EP from the QP
+	 * and we need to dereference the EP.
+	 */
+	if (free)
+		put_ep(&ep->com);
+
+	PDBG("%s exit state %d\n", __FUNCTION__, qhp->attr.state);
+	return ret;
+}
+
+static int quiesce_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_quiesce_tid(qhp->ep);
+	qhp->flags |= QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+static int resume_qp(struct iwch_qp *qhp)
+{
+	spin_lock_irq(&qhp->lock);
+	iwch_resume_tid(qhp->ep);
+	qhp->flags &= ~QP_QUIESCED;
+	spin_unlock_irq(&qhp->lock);
+	return 0;
+}
+
+int iwch_quiesce_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && !qp_quiesced(qhp)) {
+			quiesce_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && !qp_quiesced(qhp))
+			quiesce_qp(qhp);
+	}
+	return 0;
+}
+
+int iwch_resume_qps(struct iwch_cq *chp)
+{
+	int i;
+	struct iwch_qp *qhp;
+
+	for (i=0; i < T3_MAX_NUM_QP; i++) {
+		qhp = get_qhp(chp->rhp, i);
+		if (!qhp)
+			continue;
+		if ((qhp->attr.rcq == chp->cq.cqid) && qp_quiesced(qhp)) {
+			resume_qp(qhp);
+			continue;
+		}
+		if ((qhp->attr.scq == chp->cq.cqid) && qp_quiesced(qhp))
+			resume_qp(qhp);
+	}
+	return 0;
+}


From swise at opengridcomputing.com  Wed Dec 20 11:20:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:20:55 -0600
Subject: [openib-general] [PATCH  v5 06/13] iw_cxgb3 Completion Queues
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192055.19316.62329.stgit@dell3.ogc.int>


Functions to manipulate CQs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_cq.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c b/drivers/infiniband/hw/cxgb3/iwch_cq.c
new file mode 100644
index 0000000..ff09509
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include "iwch_provider.h"
+#include "iwch.h"
+
+/*
+ * Get one cq entry from cxio and map it to openib.
+ *
+ * Returns:
+ * 	0 			EMPTY;
+ *	1			cqe returned
+ *	-EAGAIN 		caller must try again
+ * 	any other -errno	fatal error
+ */
+int iwch_poll_cq_one(struct iwch_dev *rhp, struct iwch_cq *chp,
+		     struct ib_wc *wc)
+{
+	struct iwch_qp *qhp = NULL;
+	struct t3_cqe cqe, *rd_cqe;
+	struct t3_wq *wq;
+	u32 credit = 0;
+	u8 cqe_flushed;
+	u64 cookie;
+	int ret = 1;
+
+	rd_cqe = cxio_next_cqe(&chp->cq);
+
+	if (!rd_cqe)
+		return 0;
+
+	qhp = get_qhp(rhp, CQE_QPID(*rd_cqe));
+	if (!qhp)
+		wq = NULL;
+	else {
+		spin_lock(&qhp->lock);
+		wq = &(qhp->wq);
+	}
+	ret = cxio_poll_cq(wq, &(chp->cq), &cqe, &cqe_flushed, &cookie,
+				   &credit);
+	if (t3a_device(chp->rhp) && credit) {
+		PDBG("%s updating %d cq credits on id %d\n", __FUNCTION__,
+		     credit, chp->cq.cqid);
+		cxio_hal_cq_op(&rhp->rdev, &chp->cq, CQ_CREDIT_UPDATE, credit);
+	}
+
+	if (ret) {
+		ret = -EAGAIN;
+		goto out;
+	}
+	ret = 1;
+
+	wc->wr_id = cookie;
+	wc->qp_num = qhp->wq.qpid;
+	wc->vendor_err = CQE_STATUS(cqe);
+
+	PDBG("%s qpid 0x%x type %d opcode %d status 0x%x wrid hi 0x%x "
+	     "lo 0x%x cookie 0x%llx\n", __FUNCTION__,
+	     CQE_QPID(cqe), CQE_TYPE(cqe),
+	     CQE_OPCODE(cqe), CQE_STATUS(cqe), CQE_WRID_HI(cqe),
+	     CQE_WRID_LOW(cqe), cookie);
+
+	if (CQE_TYPE(cqe) == 0) {
+		if (!CQE_STATUS(cqe))
+			wc->byte_len = CQE_LEN(cqe);
+		else
+			wc->byte_len = 0;
+		wc->opcode = IB_WC_RECV;
+	} else {
+		switch (CQE_OPCODE(cqe)) {
+		case T3_RDMA_WRITE:
+			wc->opcode = IB_WC_RDMA_WRITE;
+			break;
+		case T3_READ_REQ:
+			wc->opcode = IB_WC_RDMA_READ;
+			wc->byte_len = CQE_LEN(cqe);
+			break;
+		case T3_SEND:
+		case T3_SEND_WITH_SE:
+			wc->opcode = IB_WC_SEND;
+			break;
+		case T3_BIND_MW:
+			wc->opcode = IB_WC_BIND_MW;
+			break;
+
+		/* these aren't supported yet */
+		case T3_SEND_WITH_INV:
+		case T3_SEND_WITH_SE_INV:
+		case T3_LOCAL_INV:
+		case T3_FAST_REGISTER:
+		default:
+			printk(KERN_ERR MOD "Unexpected opcode %d "
+			       "in the CQE received for QPID=0x%0x\n",
+			       CQE_OPCODE(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+			goto out;
+		}
+	}
+
+	if (cqe_flushed)
+		wc->status = IB_WC_WR_FLUSH_ERR;
+	else {
+		
+		switch (CQE_STATUS(cqe)) {
+		case TPT_ERR_SUCCESS:
+			wc->status = IB_WC_SUCCESS;
+			break;
+		case TPT_ERR_STAG:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_PDID:
+			wc->status = IB_WC_LOC_PROT_ERR;
+			break;
+		case TPT_ERR_QPID:
+		case TPT_ERR_ACCESS:
+			wc->status = IB_WC_LOC_ACCESS_ERR;
+			break;
+		case TPT_ERR_WRAP:
+			wc->status = IB_WC_GENERAL_ERR;
+			break;
+		case TPT_ERR_BOUND:
+			wc->status = IB_WC_LOC_LEN_ERR;
+			break;
+		case TPT_ERR_INVALIDATE_SHARED_MR:
+		case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+			wc->status = IB_WC_MW_BIND_ERR;
+			break;
+		case TPT_ERR_CRC:
+		case TPT_ERR_MARKER:
+		case TPT_ERR_PDU_LEN_ERR:
+		case TPT_ERR_OUT_OF_RQE:
+		case TPT_ERR_DDP_VERSION:
+		case TPT_ERR_RDMA_VERSION:
+		case TPT_ERR_DDP_QUEUE_NUM:
+		case TPT_ERR_MSN:
+		case TPT_ERR_TBIT:
+		case TPT_ERR_MO:
+		case TPT_ERR_MSN_RANGE:
+		case TPT_ERR_IRD_OVERFLOW:
+		case TPT_ERR_OPCODE:
+			wc->status = IB_WC_FATAL_ERR;
+			break;
+		case TPT_ERR_SWFLUSH:
+			wc->status = IB_WC_WR_FLUSH_ERR;
+			break;
+		default:
+			printk(KERN_ERR MOD "Unexpected cqe_status 0x%x for "
+			       "QPID=0x%0x\n", CQE_STATUS(cqe), CQE_QPID(cqe));
+			ret = -EINVAL;
+		}
+	}
+out:
+	if (wq)
+		spin_unlock(&qhp->lock);
+	return ret;
+}
+
+int iwch_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct iwch_dev *rhp;
+	struct iwch_cq *chp;
+	unsigned long flags;
+	int npolled;
+	int err = 0;
+
+	chp = to_iwch_cq(ibcq);
+	rhp = chp->rhp;
+
+	spin_lock_irqsave(&chp->lock, flags);
+	for (npolled = 0; npolled < num_entries; ++npolled) {
+#ifdef DEBUG
+		int i=0;
+#endif
+
+		/*
+	 	 * Because T3 can post CQEs that are _not_ associated
+	 	 * with a WR, we might have to poll again after removing
+	 	 * one of these.
+		 */
+		do {
+			err = iwch_poll_cq_one(rhp, chp, wc + npolled);
+#ifdef DEBUG
+			BUG_ON(++i > 1000);
+#endif
+		} while (err == -EAGAIN);
+		if (err <= 0)
+			break;
+	}
+	spin_unlock_irqrestore(&chp->lock, flags);
+
+	if (err < 0)
+		return err;
+	else {
+		return npolled;
+	}
+}
+
+int iwch_modify_cq(struct ib_cq *cq, int cqe)
+{
+	PDBG("iwch_modify_cq: TBD\n");
+	return 0;
+}


From swise at opengridcomputing.com  Wed Dec 20 11:21:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:21:25 -0600
Subject: [openib-general] [PATCH  v5 07/13] iw_cxgb3 Async Event Handler
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192125.19316.92319.stgit@dell3.ogc.int>


Code to handle async events coming from the T3 RDMA Core.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_ev.c |  231 +++++++++++++++++++++++++++++++++
 1 files changed, 231 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_ev.c b/drivers/infiniband/hw/cxgb3/iwch_ev.c
new file mode 100644
index 0000000..646f612
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_ev.c
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/slab.h>
+#include <linux/mman.h>
+#include <net/sock.h>
+#include "iwch_provider.h"
+#include "iwch.h"
+#include "iwch_cm.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+static void post_qp_event(struct iwch_dev *rnicp, struct iwch_cq *chp,
+			  struct respQ_msg_t *rsp_msg,
+			  enum ib_event_type ib_event,
+			  int send_term)
+{
+	struct ib_event event;
+	struct iwch_qp_attributes attrs;
+	struct iwch_qp *qhp;
+
+	printk(KERN_ERR "%s - AE qpid 0x%x opcode %d status 0x%x "
+	       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__,
+	       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe),
+	       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+	       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+
+	spin_lock(&rnicp->lock);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+
+	if (!qhp) {
+		printk(KERN_ERR "%s unaffiliated error 0x%x qpid 0x%x\n",
+		       __FUNCTION__, CQE_STATUS(rsp_msg->cqe),
+		       CQE_QPID(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	if ((qhp->attr.state == IWCH_QP_STATE_ERROR) ||
+	    (qhp->attr.state == IWCH_QP_STATE_TERMINATE)) {
+		PDBG("%s AE received after RTS - "
+		     "qp state %d qpid 0x%x status 0x%x\n", __FUNCTION__,
+		     qhp->attr.state, qhp->wq.qpid, CQE_STATUS(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		return;
+	}
+
+	atomic_inc(&qhp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	event.event = ib_event;
+	event.device = chp->ibcq.device;
+	if (ib_event == IB_EVENT_CQ_ERR)
+		event.element.cq = &chp->ibcq;
+	else
+		event.element.qp = &qhp->ibqp;
+
+	if (qhp->ibqp.event_handler)
+		(*qhp->ibqp.event_handler)(&event, qhp->ibqp.qp_context);
+
+	if (qhp->attr.state == IWCH_QP_STATE_RTS) {
+		attrs.next_state = IWCH_QP_STATE_TERMINATE;
+		iwch_modify_qp(qhp->rhp, qhp, IWCH_QP_ATTR_NEXT_STATE,
+			       &attrs, 1);
+		if (send_term)
+			iwch_post_terminate(qhp, rsp_msg);
+	}
+
+	if (atomic_dec_and_test(&qhp->refcnt))
+		wake_up(&qhp->wait);
+}
+
+void iwch_ev_dispatch(struct cxio_rdev *rdev_p, struct sk_buff *skb)
+{
+	struct iwch_dev *rnicp;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	struct iwch_cq *chp;
+	struct iwch_qp *qhp;
+	u32 cqid = RSPQ_CQID(rsp_msg);
+
+	rnicp = (struct iwch_dev *) rdev_p->ulp;
+	spin_lock(&rnicp->lock);
+	chp = get_chp(rnicp, cqid);
+	qhp = get_qhp(rnicp, CQE_QPID(rsp_msg->cqe));
+	if (!chp || !qhp) {
+		printk(KERN_ERR MOD "BAD AE cqid 0x%x qpid 0x%x opcode %d "
+		       "status 0x%x type %d wrid.hi 0x%x wrid.lo 0x%x \n",
+		       cqid, CQE_QPID(rsp_msg->cqe),
+		       CQE_OPCODE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe),
+		       CQE_TYPE(rsp_msg->cqe), CQE_WRID_HI(rsp_msg->cqe),
+		       CQE_WRID_LOW(rsp_msg->cqe));
+		spin_unlock(&rnicp->lock);
+		goto out;
+	}
+	iwch_qp_add_ref(&qhp->ibqp);
+	atomic_inc(&chp->refcnt);
+	spin_unlock(&rnicp->lock);
+
+	/*
+	 * 1) completion of our sending a TERMINATE.
+	 * 2) incoming TERMINATE message.
+	 */
+	if ((CQE_OPCODE(rsp_msg->cqe) == T3_TERMINATE) &&
+	    (CQE_STATUS(rsp_msg->cqe) == 0)) {
+		if (SQ_TYPE(rsp_msg->cqe)) {
+			PDBG("%s QPID 0x%x ep %p disconnecting\n",
+			     __FUNCTION__, qhp->wq.qpid, qhp->ep);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		} else {
+			PDBG("%s post REQ_ERR AE QPID 0x%x\n", __FUNCTION__,
+			     qhp->wq.qpid);
+			post_qp_event(rnicp, chp, rsp_msg,
+				      IB_EVENT_QP_REQ_ERR, 0);
+			iwch_ep_disconnect(qhp->ep, 0, GFP_ATOMIC);
+		}
+		goto done;
+	}
+
+	/* Bad incoming Read request */
+	if (SQ_TYPE(rsp_msg->cqe) &&
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_READ_RESP)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	/* Bad incoming write */
+	if (RQ_TYPE(rsp_msg->cqe) &&
+	    (CQE_OPCODE(rsp_msg->cqe) == T3_RDMA_WRITE)) {
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_REQ_ERR, 1);
+		goto done;
+	}
+
+	switch (CQE_STATUS(rsp_msg->cqe)) {
+
+	/* Completion Events */
+	case TPT_ERR_SUCCESS:
+
+		/*
+		 * Confirm the destination entry if this is a RECV completion.
+		 */
+		if (qhp->ep && SQ_TYPE(rsp_msg->cqe))
+			dst_confirm(qhp->ep->dst);
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		break;
+
+	case TPT_ERR_STAG:
+	case TPT_ERR_PDID:
+	case TPT_ERR_QPID:
+	case TPT_ERR_ACCESS:
+	case TPT_ERR_WRAP:
+	case TPT_ERR_BOUND:
+	case TPT_ERR_INVALIDATE_SHARED_MR:
+	case TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND:
+		printk(KERN_ERR "%s - CQE Err qpid 0x%x opcode %d status 0x%x "
+		       "type %d wrid.hi 0x%x wrid.lo 0x%x \n", __FUNCTION__,
+		       CQE_QPID(rsp_msg->cqe), CQE_OPCODE(rsp_msg->cqe),
+		       CQE_STATUS(rsp_msg->cqe), CQE_TYPE(rsp_msg->cqe),
+		       CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+		(*chp->ibcq.comp_handler)(&chp->ibcq, chp->ibcq.cq_context);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_ACCESS_ERR, 1);
+		break;
+
+	/* Device Fatal Errors */
+	case TPT_ERR_ECC:
+	case TPT_ERR_ECC_PSTAG:
+	case TPT_ERR_INTERNAL_ERR:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_DEVICE_FATAL, 1);
+		break;
+	
+	/* QP Fatal Errors */
+	case TPT_ERR_OUT_OF_RQE:
+	case TPT_ERR_PBL_ADDR_BOUND:
+	case TPT_ERR_CRC:
+	case TPT_ERR_MARKER:
+	case TPT_ERR_PDU_LEN_ERR:
+	case TPT_ERR_DDP_VERSION:
+	case TPT_ERR_RDMA_VERSION:
+	case TPT_ERR_OPCODE:
+	case TPT_ERR_DDP_QUEUE_NUM:
+	case TPT_ERR_MSN:
+	case TPT_ERR_TBIT:
+	case TPT_ERR_MO:
+	case TPT_ERR_MSN_GAP:
+	case TPT_ERR_MSN_RANGE:
+	case TPT_ERR_RQE_ADDR_BOUND:
+	case TPT_ERR_IRD_OVERFLOW:
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+
+	default:
+		printk(KERN_ERR MOD "Unknown T3 status 0x%x QPID 0x%x\n",
+		       CQE_STATUS(rsp_msg->cqe), qhp->wq.qpid);
+		post_qp_event(rnicp, chp, rsp_msg, IB_EVENT_QP_FATAL, 1);
+		break;
+	}
+done:
+	if (atomic_dec_and_test(&chp->refcnt))
+                wake_up(&chp->wait);
+	iwch_qp_rem_ref(&qhp->ibqp);
+out:
+	dev_kfree_skb_irq(skb);
+}


From swise at opengridcomputing.com  Wed Dec 20 11:21:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:21:55 -0600
Subject: [openib-general] [PATCH  v5 08/13] iw_cxgb3 Memory Registration
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192155.19316.73702.stgit@dell3.ogc.int>


Functions to register memory regions.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/iwch_mem.c |  170 ++++++++++++++++++++++++++++++++
 1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_mem.c b/drivers/infiniband/hw/cxgb3/iwch_mem.c
new file mode 100644
index 0000000..5909ec5
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/iwch_mem.c
@@ -0,0 +1,170 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/byteorder.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+
+#include "cxio_hal.h"
+#include "iwch.h"
+#include "iwch_provider.h"
+
+int iwch_register_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	if (cxio_register_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int iwch_reregister_mem(struct iwch_dev *rhp, struct iwch_pd *php,
+					struct iwch_mr *mhp,
+					int shift,
+					__be64 *page_list,
+					int npages)
+{
+	u32 stag;
+	u32 mmid;
+
+
+	/* We could support this... */
+	if (npages > mhp->attr.pbl_size)
+		return -ENOMEM;
+
+	stag = mhp->attr.stag;
+	if (cxio_reregister_phys_mem(&rhp->rdev,
+				   &stag, mhp->attr.pdid,
+				   mhp->attr.perms,
+				   mhp->attr.zbva,
+				   mhp->attr.va_fbo,
+				   mhp->attr.len,
+				   shift-12,
+				   page_list,
+				   &mhp->attr.pbl_size, &mhp->attr.pbl_addr))
+		return -ENOMEM;
+	mhp->attr.state = 1;
+	mhp->attr.stag = stag;
+	mmid = stag >> 8;
+	mhp->ibmr.rkey = mhp->ibmr.lkey = stag;
+	insert_handle(rhp, &rhp->mmidr, mhp, mmid);
+	PDBG("%s mmid 0x%x mhp %p\n", __FUNCTION__, mmid, mhp);
+	return 0;
+}
+
+int build_phys_page_list(struct ib_phys_buf *buffer_list,
+					int num_phys_buf,
+					u64 *iova_start,
+					u64 *total_size,
+					int *npages,
+					int *shift,
+					__be64 **page_list)
+{
+	u64 mask;
+	int i, j, n;
+
+	mask = 0;
+	*total_size = 0;
+	for (i = 0; i < num_phys_buf; ++i) {
+		if (i != 0 && buffer_list[i].addr & ~PAGE_MASK)
+			return -EINVAL;
+		if (i != 0 && i != num_phys_buf - 1 &&
+		    (buffer_list[i].size & ~PAGE_MASK))
+			return -EINVAL;
+		*total_size += buffer_list[i].size;
+		if (i > 0)
+			mask |= buffer_list[i].addr;
+	}
+
+	if (*total_size > 0xFFFFFFFFULL)
+		return -ENOMEM;
+
+	/* Find largest page shift we can use to cover buffers */
+	for (*shift = PAGE_SHIFT; *shift < 27; ++(*shift))
+		if (num_phys_buf > 1) {
+			if ((1ULL << *shift) & mask)
+				break;
+		} else
+			if (1ULL << *shift >=
+			    buffer_list[0].size +
+			    (buffer_list[0].addr & ((1ULL << *shift) - 1)))
+				break;
+
+	buffer_list[0].size += buffer_list[0].addr & ((1ULL << *shift) - 1);
+	buffer_list[0].addr &= ~0ull << *shift;
+
+	*npages = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		*npages += (buffer_list[i].size +
+			(1ULL << *shift) - 1) >> *shift;
+
+	if (!*npages)
+		return -EINVAL;
+
+	*page_list = kmalloc(sizeof(u64) * *npages, GFP_KERNEL);
+	if (!*page_list)
+		return -ENOMEM;
+
+	n = 0;
+	for (i = 0; i < num_phys_buf; ++i)
+		for (j = 0;
+		     j < (buffer_list[i].size + (1ULL << *shift) - 1) >> *shift;
+		     ++j)
+			(*page_list)[n++] = cpu_to_be64(buffer_list[i].addr +
+			    ((u64) j << *shift));
+
+	PDBG("%s va 0x%llx mask 0x%llx shift %d len %lld pbl_size %d\n",
+	     __FUNCTION__, *iova_start, mask, *shift, *total_size, *npages);
+
+	return 0;
+
+}


From swise at opengridcomputing.com  Wed Dec 20 11:22:25 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:22:25 -0600
Subject: [openib-general] [PATCH  v5 09/13] iw_cxgb3 Core WQE/CQE Types
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192225.19316.33284.stgit@dell3.ogc.int>


T3 WQE and CQE structures, defines, etc...

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_wr.h |  685 ++++++++++++++++++++++++++++
 1 files changed, 685 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_wr.h b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
new file mode 100644
index 0000000..234a084
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_wr.h
@@ -0,0 +1,685 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_WR_H__
+#define __CXIO_WR_H__
+
+#include <asm/io.h>
+#include <linux/pci.h>
+#include <linux/timer.h>
+#include "firmware_exports.h"
+
+#define T3_MAX_SGE      4
+
+#define Q_EMPTY(rptr,wptr) ((rptr)==(wptr))
+#define Q_FULL(rptr,wptr,size_log2)  ( (((wptr)-(rptr))>>(size_log2)) && \
+				       ((rptr)!=(wptr)) )
+#define Q_GENBIT(ptr,size_log2) (!(((ptr)>>size_log2)&0x1))
+#define Q_FREECNT(rptr,wptr,size_log2) ((1UL<<size_log2)-((wptr)-(rptr)))
+#define Q_COUNT(rptr,wptr) ((wptr)-(rptr))
+#define Q_PTR2IDX(ptr,size_log2) (ptr & ((1UL<<size_log2)-1))
+
+static inline void ring_doorbell(void __iomem *doorbell, u32 qpid)
+{
+	writel(((1<<31) | qpid), doorbell);
+}
+
+#define SEQ32_GE(x,y) (!( (((u32) (x)) - ((u32) (y))) & 0x80000000 ))
+
+enum t3_wr_flags {
+	T3_COMPLETION_FLAG = 0x01,
+	T3_NOTIFY_FLAG = 0x02,
+	T3_SOLICITED_EVENT_FLAG = 0x04,
+	T3_READ_FENCE_FLAG = 0x08,
+	T3_LOCAL_FENCE_FLAG = 0x10
+} __attribute__ ((packed));
+
+enum t3_wr_opcode {
+	T3_WR_BP = FW_WROPCODE_RI_BYPASS,
+	T3_WR_SEND = FW_WROPCODE_RI_SEND,
+	T3_WR_WRITE = FW_WROPCODE_RI_RDMA_WRITE,
+	T3_WR_READ = FW_WROPCODE_RI_RDMA_READ,
+	T3_WR_INV_STAG = FW_WROPCODE_RI_LOCAL_INV,
+	T3_WR_BIND = FW_WROPCODE_RI_BIND_MW,
+	T3_WR_RCV = FW_WROPCODE_RI_RECEIVE,
+	T3_WR_INIT = FW_WROPCODE_RI_RDMA_INIT,
+	T3_WR_QP_MOD = FW_WROPCODE_RI_MODIFY_QP
+} __attribute__ ((packed));
+
+enum t3_rdma_opcode {
+	T3_RDMA_WRITE,		/* IETF RDMAP v1.0 ... */
+	T3_READ_REQ,
+	T3_READ_RESP,
+	T3_SEND,
+	T3_SEND_WITH_INV,
+	T3_SEND_WITH_SE,
+	T3_SEND_WITH_SE_INV,
+	T3_TERMINATE,
+	T3_RDMA_INIT,		/* CHELSIO RI specific ... */
+	T3_BIND_MW,
+	T3_FAST_REGISTER,
+	T3_LOCAL_INV,
+	T3_QP_MOD,
+	T3_BYPASS
+} __attribute__ ((packed));
+
+static inline enum t3_rdma_opcode wr2opcode(enum t3_wr_opcode wrop)
+{
+	switch (wrop) {
+		case T3_WR_BP: return T3_BYPASS;
+		case T3_WR_SEND: return T3_SEND;
+		case T3_WR_WRITE: return T3_RDMA_WRITE;
+		case T3_WR_READ: return T3_READ_REQ;
+		case T3_WR_INV_STAG: return T3_LOCAL_INV;
+		case T3_WR_BIND: return T3_BIND_MW;
+		case T3_WR_INIT: return T3_RDMA_INIT;
+		case T3_WR_QP_MOD: return T3_QP_MOD;
+		default: break;
+	}
+	return -1;
+}
+
+
+/* Work request id */
+union t3_wrid {
+	struct {
+		u32 hi;
+		u32 low;
+	} id0;
+	u64 id1;
+};
+
+#define WRID(wrid)      	(wrid.id1)
+#define WRID_GEN(wrid)		(wrid.id0.wr_gen)
+#define WRID_IDX(wrid)		(wrid.id0.wr_idx)
+#define WRID_LO(wrid)		(wrid.id0.wr_lo)
+
+struct fw_riwrh {
+	__be32 op_seop_flags;
+	__be32 gen_tid_len;
+};
+
+#define S_FW_RIWR_OP		24
+#define M_FW_RIWR_OP		0xff
+#define V_FW_RIWR_OP(x)		((x) << S_FW_RIWR_OP)
+#define G_FW_RIWR_OP(x)   	((((x) >> S_FW_RIWR_OP)) & M_FW_RIWR_OP)
+
+#define S_FW_RIWR_SOPEOP	22
+#define M_FW_RIWR_SOPEOP	0x3
+#define V_FW_RIWR_SOPEOP(x)	((x) << S_FW_RIWR_SOPEOP)
+
+#define S_FW_RIWR_FLAGS		8
+#define M_FW_RIWR_FLAGS		0x3fffff
+#define V_FW_RIWR_FLAGS(x)	((x) << S_FW_RIWR_FLAGS)
+#define G_FW_RIWR_FLAGS(x)   	((((x) >> S_FW_RIWR_FLAGS)) & M_FW_RIWR_FLAGS)
+
+#define S_FW_RIWR_TID		8
+#define V_FW_RIWR_TID(x)	((x) << S_FW_RIWR_TID)
+
+#define S_FW_RIWR_LEN		0
+#define V_FW_RIWR_LEN(x)	((x) << S_FW_RIWR_LEN)
+
+#define S_FW_RIWR_GEN           31
+#define V_FW_RIWR_GEN(x)        ((x)  << S_FW_RIWR_GEN)
+
+struct t3_sge {
+	__be32 stag;
+	__be32 len;
+	__be64 to;
+};
+
+/* If num_sgle is zero, flit 5+ contains immediate data.*/
+struct t3_send_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;	
+	__be32 plen;		/* 3 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 4+ */
+};
+
+struct t3_local_inv_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 stag;		/* 2 */
+	__be32 reserved3;
+};
+
+struct t3_rdma_write_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 stag_sink;
+	__be64 to_sink;		/* 3 */
+	__be32 plen;		/* 4 */
+	__be32 num_sgle;
+	struct t3_sge sgl[T3_MAX_SGE];	/* 5+ */
+};
+
+struct t3_rdma_read_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 rdmaop;		/* 2 */
+	u8 reserved[3];
+	__be32 rem_stag;
+	__be64 rem_to;		/* 3 */
+	__be32 local_stag;	/* 4 */
+	__be32 local_len;
+	__be64 local_to;	/* 5 */
+};
+
+enum t3_addr_type {
+	T3_VA_BASED_TO = 0x0,
+	T3_ZERO_BASED_TO = 0x1
+} __attribute__ ((packed));
+
+enum t3_mem_perms {
+	T3_MEM_ACCESS_LOCAL_READ = 0x1,
+	T3_MEM_ACCESS_LOCAL_WRITE = 0x2,
+	T3_MEM_ACCESS_REM_READ = 0x4,
+	T3_MEM_ACCESS_REM_WRITE = 0x8
+} __attribute__ ((packed));
+
+struct t3_bind_mw_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u16 reserved;		/* 2 */
+	u8 type;
+	u8 perms;
+	__be32 mr_stag;
+	__be32 mw_stag;		/* 3 */
+	__be32 mw_len;
+	__be64 mw_va;		/* 4 */
+	__be32 mr_pbl_addr;	/* 5 */
+	u8 reserved2[3];
+	u8 mr_pagesz;
+};
+
+struct t3_receive_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	u8 pagesz[T3_MAX_SGE];
+	__be32 num_sgle;		/* 2 */
+	struct t3_sge sgl[T3_MAX_SGE];	/* 3+ */
+	__be32 pbl_addr[T3_MAX_SGE];
+};
+
+struct t3_bypass_wr {
+	struct fw_riwrh wrh;
+	union t3_wrid wrid;	/* 1 */
+};
+
+struct t3_modify_qp_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 flags;		/* 2 */
+	__be32 quiesce;		/* 2 */
+	__be32 max_ird;		/* 3 */
+	__be32 max_ord;		/* 3 */
+	__be64 sge_cmd;		/* 4 */
+	__be64 ctx1;		/* 5 */
+	__be64 ctx0;		/* 6 */
+};
+
+enum t3_modify_qp_flags {
+	MODQP_QUIESCE  = 0x01,
+	MODQP_MAX_IRD  = 0x02,
+	MODQP_MAX_ORD  = 0x04,
+	MODQP_WRITE_EC = 0x08,
+	MODQP_READ_EC  = 0x10,
+};
+	
+
+enum t3_mpa_attrs {
+	uP_RI_MPA_RX_MARKER_ENABLE = 0x1,
+	uP_RI_MPA_TX_MARKER_ENABLE = 0x2,
+	uP_RI_MPA_CRC_ENABLE = 0x4,
+	uP_RI_MPA_IETF_ENABLE = 0x8
+} __attribute__ ((packed));
+
+enum t3_qp_caps {
+	uP_RI_QP_RDMA_READ_ENABLE = 0x01,
+	uP_RI_QP_RDMA_WRITE_ENABLE = 0x02,
+	uP_RI_QP_BIND_ENABLE = 0x04,
+	uP_RI_QP_FAST_REGISTER_ENABLE = 0x08,
+	uP_RI_QP_STAG0_ENABLE = 0x10
+} __attribute__ ((packed));
+
+struct t3_rdma_init_attr {
+	u32 tid;
+	u32 qpid;
+	u32 pdid;
+	u32 scqid;
+	u32 rcqid;
+	u32 rq_addr;
+	u32 rq_size;
+	enum t3_mpa_attrs mpaattrs;
+	enum t3_qp_caps qpcaps;
+	u16 tcp_emss;
+	u32 ord;
+	u32 ird;
+	u64 qp_dma_addr;
+	u32 qp_dma_size;
+	u32 flags;
+};
+
+struct t3_rdma_init_wr {
+	struct fw_riwrh wrh;	/* 0 */
+	union t3_wrid wrid;	/* 1 */
+	__be32 qpid;		/* 2 */
+	__be32 pdid;
+	__be32 scqid;		/* 3 */
+	__be32 rcqid;
+	__be32 rq_addr;		/* 4 */
+	__be32 rq_size;
+	u8 mpaattrs;		/* 5 */
+	u8 qpcaps;
+	__be16 ulpdu_size;
+	__be32 flags;		/* bits 31-1 - reservered */
+				/* bit     0 - set if RECV posted */
+	__be32 ord;		/* 6 */
+	__be32 ird;
+	__be64 qp_dma_addr;	/* 7 */
+	__be32 qp_dma_size;	/* 8 */
+	u32 rsvd;
+};
+
+struct t3_genbit {
+	u64 flit[15];
+	__be64 genbit;
+};
+
+enum rdma_init_wr_flags {
+	RECVS_POSTED = 1,
+};
+
+union t3_wr {
+	struct t3_send_wr send;
+	struct t3_rdma_write_wr write;
+	struct t3_rdma_read_wr read;
+	struct t3_receive_wr recv;
+	struct t3_local_inv_wr local_inv;
+	struct t3_bind_mw_wr bind;
+	struct t3_bypass_wr bypass;
+	struct t3_rdma_init_wr init;
+	struct t3_modify_qp_wr qp_mod;
+	struct t3_genbit genbit;
+	u64 flit[16];
+};
+
+#define T3_SQ_CQE_FLIT 	  13
+#define T3_SQ_COOKIE_FLIT 14
+
+#define T3_RQ_COOKIE_FLIT 13
+#define T3_RQ_CQE_FLIT 	  14
+
+static inline enum t3_wr_opcode fw_riwrh_opcode(struct fw_riwrh *wqe)
+{
+	return G_FW_RIWR_OP(be32_to_cpu(wqe->op_seop_flags));
+}
+
+static inline void build_fw_riwrh(struct fw_riwrh *wqe, enum t3_wr_opcode op,
+				  enum t3_wr_flags flags, u8 genbit, u32 tid,
+				  u8 len)
+{
+	wqe->op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(op) |
+					 V_FW_RIWR_SOPEOP(M_FW_RIWR_SOPEOP) |
+					 V_FW_RIWR_FLAGS(flags));
+	wmb();
+	wqe->gen_tid_len = cpu_to_be32(V_FW_RIWR_GEN(genbit) |
+				       V_FW_RIWR_TID(tid) |
+				       V_FW_RIWR_LEN(len));
+	/* 2nd gen bit... */
+        ((union t3_wr *)wqe)->genbit.genbit = cpu_to_be64(genbit);
+}
+
+/*
+ * T3 ULP2_TX commands
+ */
+enum t3_utx_mem_op {
+	T3_UTX_MEM_READ = 2,
+	T3_UTX_MEM_WRITE = 3
+};
+
+/* T3 MC7 RDMA TPT entry format */
+
+enum tpt_mem_type {
+	TPT_NON_SHARED_MR = 0x0,
+	TPT_SHARED_MR = 0x1,
+	TPT_MW = 0x2,
+	TPT_MW_RELAXED_PROTECTION = 0x3
+};
+
+enum tpt_addr_type {
+	TPT_ZBTO = 0,
+	TPT_VATO = 1
+};
+
+enum tpt_mem_perm {
+	TPT_LOCAL_READ = 0x8,
+	TPT_LOCAL_WRITE = 0x4,
+	TPT_REMOTE_READ = 0x2,
+	TPT_REMOTE_WRITE = 0x1
+};
+
+struct tpt_entry {
+	__be32 valid_stag_pdid;
+	__be32 flags_pagesize_qpid;
+
+	__be32 rsvd_pbl_addr;
+	__be32 len;
+	__be32 va_hi;
+	__be32 va_low_or_fbo;
+
+	__be32 rsvd_bind_cnt_or_pstag;
+	__be32 rsvd_pbl_size;
+};
+
+#define S_TPT_VALID		31
+#define V_TPT_VALID(x)		((x) << S_TPT_VALID)
+#define F_TPT_VALID		V_TPT_VALID(1U)
+
+#define S_TPT_STAG_KEY		23
+#define M_TPT_STAG_KEY		0xFF
+#define V_TPT_STAG_KEY(x)	((x) << S_TPT_STAG_KEY)
+#define G_TPT_STAG_KEY(x)	(((x) >> S_TPT_STAG_KEY) & M_TPT_STAG_KEY)
+
+#define S_TPT_STAG_STATE	22
+#define V_TPT_STAG_STATE(x)	((x) << S_TPT_STAG_STATE)
+#define F_TPT_STAG_STATE	V_TPT_STAG_STATE(1U)
+
+#define S_TPT_STAG_TYPE		20
+#define M_TPT_STAG_TYPE		0x3
+#define V_TPT_STAG_TYPE(x)	((x) << S_TPT_STAG_TYPE)
+#define G_TPT_STAG_TYPE(x)	(((x) >> S_TPT_STAG_TYPE) & M_TPT_STAG_TYPE)
+
+#define S_TPT_PDID		0
+#define M_TPT_PDID		0xFFFFF
+#define V_TPT_PDID(x)		((x) << S_TPT_PDID)
+#define G_TPT_PDID(x)		(((x) >> S_TPT_PDID) & M_TPT_PDID)
+
+#define S_TPT_PERM		28
+#define M_TPT_PERM		0xF
+#define V_TPT_PERM(x)		((x) << S_TPT_PERM)
+#define G_TPT_PERM(x)		(((x) >> S_TPT_PERM) & M_TPT_PERM)
+
+#define S_TPT_REM_INV_DIS	27
+#define V_TPT_REM_INV_DIS(x)	((x) << S_TPT_REM_INV_DIS)
+#define F_TPT_REM_INV_DIS	V_TPT_REM_INV_DIS(1U)
+
+#define S_TPT_ADDR_TYPE		26
+#define V_TPT_ADDR_TYPE(x)	((x) << S_TPT_ADDR_TYPE)
+#define F_TPT_ADDR_TYPE		V_TPT_ADDR_TYPE(1U)
+
+#define S_TPT_MW_BIND_ENABLE	25
+#define V_TPT_MW_BIND_ENABLE(x)	((x) << S_TPT_MW_BIND_ENABLE)
+#define F_TPT_MW_BIND_ENABLE    V_TPT_MW_BIND_ENABLE(1U)
+
+#define S_TPT_PAGE_SIZE		20
+#define M_TPT_PAGE_SIZE		0x1F
+#define V_TPT_PAGE_SIZE(x)	((x) << S_TPT_PAGE_SIZE)
+#define G_TPT_PAGE_SIZE(x)	(((x) >> S_TPT_PAGE_SIZE) & M_TPT_PAGE_SIZE)
+
+#define S_TPT_PBL_ADDR		0
+#define M_TPT_PBL_ADDR		0x1FFFFFFF
+#define V_TPT_PBL_ADDR(x)	((x) << S_TPT_PBL_ADDR)
+#define G_TPT_PBL_ADDR(x)       (((x) >> S_TPT_PBL_ADDR) & M_TPT_PBL_ADDR)
+
+#define S_TPT_QPID		0
+#define M_TPT_QPID		0xFFFFF
+#define V_TPT_QPID(x)		((x) << S_TPT_QPID)
+#define G_TPT_QPID(x)		(((x) >> S_TPT_QPID) & M_TPT_QPID)
+
+#define S_TPT_PSTAG		0
+#define M_TPT_PSTAG		0xFFFFFF
+#define V_TPT_PSTAG(x)		((x) << S_TPT_PSTAG)
+#define G_TPT_PSTAG(x)		(((x) >> S_TPT_PSTAG) & M_TPT_PSTAG)
+
+#define S_TPT_PBL_SIZE		0
+#define M_TPT_PBL_SIZE		0xFFFFF
+#define V_TPT_PBL_SIZE(x)	((x) << S_TPT_PBL_SIZE)
+#define G_TPT_PBL_SIZE(x)	(((x) >> S_TPT_PBL_SIZE) & M_TPT_PBL_SIZE)
+
+/*
+ * CQE defs
+ */
+struct t3_cqe {
+	__be32 header;
+	__be32 len;
+	union {
+		struct {
+			__be32 stag;
+			__be32 msn;
+		} rcqe;
+		struct {
+			u32 wrid_hi;	
+			u32 wrid_low;
+		} scqe;
+	} u;
+};
+
+#define S_CQE_OOO	  31
+#define M_CQE_OOO	  0x1
+#define G_CQE_OOO(x)	  ((((x) >> S_CQE_OOO)) & M_CQE_OOO)
+#define V_CEQ_OOO(x)	  ((x)<<S_CQE_OOO)
+
+#define S_CQE_QPID        12
+#define M_CQE_QPID        0x7FFFF
+#define G_CQE_QPID(x)     ((((x) >> S_CQE_QPID)) & M_CQE_QPID)
+#define V_CQE_QPID(x) 	  ((x)<<S_CQE_QPID)
+
+#define S_CQE_SWCQE       11
+#define M_CQE_SWCQE       0x1
+#define G_CQE_SWCQE(x)    ((((x) >> S_CQE_SWCQE)) & M_CQE_SWCQE)
+#define V_CQE_SWCQE(x) 	  ((x)<<S_CQE_SWCQE)
+
+#define S_CQE_GENBIT      10
+#define M_CQE_GENBIT      0x1
+#define G_CQE_GENBIT(x)   (((x) >> S_CQE_GENBIT) & M_CQE_GENBIT)
+#define V_CQE_GENBIT(x)	  ((x)<<S_CQE_GENBIT)
+
+#define S_CQE_STATUS      5
+#define M_CQE_STATUS      0x1F
+#define G_CQE_STATUS(x)   ((((x) >> S_CQE_STATUS)) & M_CQE_STATUS)
+#define V_CQE_STATUS(x)   ((x)<<S_CQE_STATUS)
+
+#define S_CQE_TYPE        4
+#define M_CQE_TYPE        0x1
+#define G_CQE_TYPE(x)     ((((x) >> S_CQE_TYPE)) & M_CQE_TYPE)
+#define V_CQE_TYPE(x)     ((x)<<S_CQE_TYPE)
+
+#define S_CQE_OPCODE      0
+#define M_CQE_OPCODE      0xF
+#define G_CQE_OPCODE(x)   ((((x) >> S_CQE_OPCODE)) & M_CQE_OPCODE)
+#define V_CQE_OPCODE(x)   ((x)<<S_CQE_OPCODE)
+
+#define SW_CQE(x)         (G_CQE_SWCQE(be32_to_cpu((x).header)))
+#define CQE_OOO(x)        (G_CQE_OOO(be32_to_cpu((x).header)))
+#define CQE_QPID(x)       (G_CQE_QPID(be32_to_cpu((x).header)))
+#define CQE_GENBIT(x)     (G_CQE_GENBIT(be32_to_cpu((x).header)))
+#define CQE_TYPE(x)       (G_CQE_TYPE(be32_to_cpu((x).header)))
+#define SQ_TYPE(x)	  (CQE_TYPE((x)))
+#define RQ_TYPE(x)	  (!CQE_TYPE((x)))
+#define CQE_STATUS(x)     (G_CQE_STATUS(be32_to_cpu((x).header)))
+#define CQE_OPCODE(x)     (G_CQE_OPCODE(be32_to_cpu((x).header)))
+
+#define CQE_LEN(x)        (be32_to_cpu((x).len))
+
+/* used for RQ completion processing */
+#define CQE_WRID_STAG(x)  (be32_to_cpu((x).u.rcqe.stag))
+#define CQE_WRID_MSN(x)   (be32_to_cpu((x).u.rcqe.msn))
+
+/* used for SQ completion processing */
+#define CQE_WRID_SQ_WPTR(x)	((x).u.scqe.wrid_hi)
+#define CQE_WRID_WPTR(x)   	((x).u.scqe.wrid_low)
+
+/* generic accessor macros */
+#define CQE_WRID_HI(x)		((x).u.scqe.wrid_hi)
+#define CQE_WRID_LOW(x) 	((x).u.scqe.wrid_low)
+
+#define TPT_ERR_SUCCESS                     0x0
+#define TPT_ERR_STAG                        0x1	 /* STAG invalid: either the */
+						 /* STAG is offlimt, being 0, */
+						 /* or STAG_key mismatch */
+#define TPT_ERR_PDID                        0x2	 /* PDID mismatch */
+#define TPT_ERR_QPID                        0x3	 /* QPID mismatch */
+#define TPT_ERR_ACCESS                      0x4	 /* Invalid access right */
+#define TPT_ERR_WRAP                        0x5	 /* Wrap error */
+#define TPT_ERR_BOUND                       0x6	 /* base and bounds voilation */
+#define TPT_ERR_INVALIDATE_SHARED_MR        0x7	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_INVALIDATE_MR_WITH_MW_BOUND 0x8	 /* attempt to invalidate a  */
+						 /* shared memory region */
+#define TPT_ERR_ECC                         0x9	 /* ECC error detected */
+#define TPT_ERR_ECC_PSTAG                   0xA	 /* ECC error detected when  */
+						 /* reading PSTAG for a MW  */
+						 /* Invalidate */
+#define TPT_ERR_PBL_ADDR_BOUND              0xB	 /* pbl addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_SWFLUSH			    0xC	 /* SW FLUSHED */
+#define TPT_ERR_CRC                         0x10 /* CRC error */
+#define TPT_ERR_MARKER                      0x11 /* Marker error */
+#define TPT_ERR_PDU_LEN_ERR                 0x12 /* invalid PDU length */
+#define TPT_ERR_OUT_OF_RQE                  0x13 /* out of RQE */
+#define TPT_ERR_DDP_VERSION                 0x14 /* wrong DDP version */
+#define TPT_ERR_RDMA_VERSION                0x15 /* wrong RDMA version */
+#define TPT_ERR_OPCODE                      0x16 /* invalid rdma opcode */
+#define TPT_ERR_DDP_QUEUE_NUM               0x17 /* invalid ddp queue number */
+#define TPT_ERR_MSN                         0x18 /* MSN error */
+#define TPT_ERR_TBIT                        0x19 /* tag bit not set correctly */
+#define TPT_ERR_MO                          0x1A /* MO not 0 for TERMINATE  */
+						 /* or READ_REQ */
+#define TPT_ERR_MSN_GAP                     0x1B
+#define TPT_ERR_MSN_RANGE                   0x1C
+#define TPT_ERR_IRD_OVERFLOW                0x1D
+#define TPT_ERR_RQE_ADDR_BOUND              0x1E /* RQE addr out of bounds:  */
+						 /* software error */
+#define TPT_ERR_INTERNAL_ERR                0x1F /* internal error (opcode  */
+						 /* mismatch) */
+
+struct t3_swsq {
+	__u64 			wr_id;
+	struct t3_cqe 		cqe;
+	__u32			sq_wptr;
+	__be32			read_len;
+	int 			opcode;
+	int			complete;
+	int			signaled;	
+};
+
+/*
+ * A T3 WQ implements both the SQ and RQ.
+ */
+struct t3_wq {
+	union t3_wr *queue;		/* DMA accessable memory */
+	dma_addr_t dma_addr;		/* DMA address for HW */
+	DECLARE_PCI_UNMAP_ADDR(mapping)	/* unmap kruft */
+	u32 error;			/* 1 once we go to ERROR */
+	u32 qpid;
+	u32 wptr;			/* idx to next available WR slot */
+	u32 size_log2;			/* total wq size */
+	struct t3_swsq *sq;		/* SW SQ */
+	struct t3_swsq *oldest_read;	/* tracks oldest pending read */
+	u32 sq_wptr;			/* sq_wptr - sq_rptr == count of */
+	u32 sq_rptr;			/* pending wrs */
+	u32 sq_size_log2;		/* sq size */
+	u64 *rq;			/* SW RQ (holds consumer wr_ids */
+	u32 rq_wptr;			/* rq_wptr - rq_rptr == count of */
+	u32 rq_rptr;			/* pending wrs */
+	u64 *rq_oldest_wr;		/* oldest wr on the SW RQ */
+	u32 rq_size_log2;		/* rq size */
+	u32 rq_addr;			/* rq adapter address */
+	void __iomem *doorbell;		/* kernel db */
+	u64 udb;			/* user db if any */
+};
+
+struct t3_cq {
+	u32 cqid;
+	u32 rptr;
+	u32 wptr;
+	u32 size_log2;
+	dma_addr_t dma_addr;
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	struct t3_cqe *queue;
+	struct t3_cqe *sw_queue;
+	u32 sw_rptr;
+	u32 sw_wptr;
+};
+
+#define CQ_VLD_ENTRY(ptr,size_log2,cqe) (Q_GENBIT(ptr,size_log2) == \
+					 CQE_GENBIT(*cqe))
+
+static inline void cxio_set_wq_in_error(struct t3_wq *wq)
+{
+	wq->queue->flit[13] = 1;
+}
+
+static inline struct t3_cqe *cxio_next_hw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_sw_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	return NULL;
+}
+
+static inline struct t3_cqe *cxio_next_cqe(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe;
+
+	if (!Q_EMPTY(cq->sw_rptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(cq->sw_rptr, cq->size_log2));
+		return cqe;
+	}
+	cqe = cq->queue + (Q_PTR2IDX(cq->rptr, cq->size_log2));
+	if (CQ_VLD_ENTRY(cq->rptr, cq->size_log2, cqe))
+		return cqe;
+	return NULL;
+}
+
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:22:55 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:22:55 -0600
Subject: [openib-general] [PATCH  v5 10/13] iw_cxgb3 Core HAL
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192255.19316.19320.stgit@dell3.ogc.int>


The RDMA Core interfaces with the T3 HW and ULLD providing a low level
RDMA interface.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_hal.c | 1302 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_hal.h |  201 ++++
 2 files changed, 1503 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.c b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
new file mode 100644
index 0000000..5e31816
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.c
@@ -0,0 +1,1302 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <asm/semaphore.h>
+#include <asm/delay.h>
+
+#include <linux/netdevice.h>
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/pci.h>
+
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+#include "cxgb3_offload.h"
+#include "sge_defs.h"
+
+static struct cxio_rdev *rdev_tbl[T3_MAX_NUM_RNIC];
+static cxio_hal_ev_callback_func_t cxio_ev_cb = NULL;
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_name(char *dev_name)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (!strcmp(rdev_tbl[i]->dev_name, dev_name))
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline struct cxio_rdev *cxio_hal_find_rdev_by_t3cdev(struct t3cdev
+							     *tdev)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i])
+			if (rdev_tbl[i]->t3cdev_p == tdev)
+				return rdev_tbl[i];
+	return NULL;
+}
+
+static inline int cxio_hal_add_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (!rdev_tbl[i]) {
+			rdev_tbl[i] = rdev_p;
+			break;
+		}
+	return (i == T3_MAX_NUM_RNIC);
+}
+
+static inline void cxio_hal_delete_rdev(struct cxio_rdev *rdev_p)
+{
+	int i;
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		if (rdev_tbl[i] == rdev_p) {
+			rdev_tbl[i] = NULL;
+			break;
+		}
+}
+
+int cxio_hal_cq_op(struct cxio_rdev *rdev_p, struct t3_cq *cq,
+		   enum t3_cq_opcode op, u32 credit)
+{
+	int ret;
+	struct t3_cqe *cqe;
+	u32 rptr;
+
+	struct rdma_cq_op setup;
+	setup.id = cq->cqid;
+	setup.credits = (op == CQ_CREDIT_UPDATE) ? credit : 0;
+	setup.op = op;
+	ret = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_OP, &setup);
+
+	if ((ret < 0) || (op == CQ_CREDIT_UPDATE))
+		return ret;
+
+	/*
+	 * If the rearm returned an index other than our current index,
+	 * then there might be CQE's in flight (being DMA'd).  We must wait
+	 * here for them to complete or the consumer can miss a notification.
+	 */
+	if (Q_PTR2IDX((cq->rptr), cq->size_log2) != ret) {
+		int i=0;
+
+		rptr = cq->rptr;
+
+		/*
+		 * Keep the generation correct by bumping rptr until it
+		 * matches the index returned by the rearm - 1.
+	 	 */
+		while (Q_PTR2IDX((rptr+1), cq->size_log2) != ret)
+			rptr++;
+
+		/*
+		 * Now rptr is the index for the (last) cqe that was
+	 	 * in-flight at the time the HW rearmed the CQ.  We
+		 * spin until that CQE is valid.
+	 	 */
+		cqe = cq->queue + Q_PTR2IDX(rptr, cq->size_log2);
+		while (!CQ_VLD_ENTRY(rptr, cq->size_log2, cqe)) {
+			udelay(1);
+			if (i++ > 1000000) {
+				BUG_ON(1);
+				printk(KERN_ERR "%s: stalled rnic\n",
+				       rdev_p->dev_name);
+				return -EIO;
+			}
+		}
+	}
+	return 0;
+}
+
+static inline int cxio_hal_clear_cq_ctx(struct cxio_rdev *rdev_p, u32 cqid)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cqid;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 0;		/* disaable the CQ */
+	setup.credits = 0;
+	setup.credit_thres = 0;
+	setup.ovfl_mode = 0;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev_p, u32 qpid)
+{
+	u64 sge_cmd;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 3, 1, qpid, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = qpid << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+int cxio_create_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	int size = (1UL << (cq->size_log2)) * sizeof(struct t3_cqe);
+
+	cq->cqid = cxio_hal_get_cqid(rdev_p->rscp);
+	if (!cq->cqid)
+		return -ENOMEM;
+	cq->sw_queue = kzalloc(size, GFP_KERNEL);
+	if (!cq->sw_queue)
+		return -ENOMEM;
+	cq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     (1UL << (cq->size_log2)) *
+					     sizeof(struct t3_cqe),
+					     &(cq->dma_addr), GFP_KERNEL);
+	if (!cq->queue) {
+		kfree(cq->sw_queue);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(cq, mapping, cq->dma_addr);
+	memset(cq->queue, 0, size);
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = 65535;
+	setup.credit_thres = 1;
+	if (rdev_p->t3cdev_p->type == T3B)
+		setup.ovfl_mode = 0;
+	else
+		setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+int cxio_resize_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	struct rdma_cq_setup setup;
+	setup.id = cq->cqid;
+	setup.base_addr = (u64) (cq->dma_addr);
+	setup.size = 1UL << cq->size_log2;
+	setup.credits = setup.size;
+	setup.credit_thres = setup.size;	/* TBD: overflow recovery */
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static u32 get_qpid(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	u32 qpid;
+	int i;
+
+	mutex_lock(&uctx->lock);
+	if (!list_empty(&uctx->qpids)) {
+		entry = list_entry(uctx->qpids.next, struct cxio_qpid_list,
+				   entry);
+		list_del(&entry->entry);
+		qpid = entry->qpid;
+		kfree(entry);
+	} else {
+		qpid = cxio_hal_get_qpid(rdev_p->rscp);
+		if (!qpid)
+			goto out;
+		for (i = qpid+1; i & rdev_p->qpmask; i++) {
+			entry = kmalloc(sizeof *entry, GFP_KERNEL);
+			if (!entry)
+				break;
+			entry->qpid = i;
+			list_add_tail(&entry->entry, &uctx->qpids);
+		}
+	}
+out:
+	mutex_unlock(&uctx->lock);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+static void put_qpid(struct cxio_rdev *rdev_p, u32 qpid,
+		     struct cxio_ucontext *uctx)
+{
+	struct cxio_qpid_list *entry;
+	
+	entry = kmalloc(sizeof *entry, GFP_KERNEL);
+	if (!entry)
+		return;
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	entry->qpid = qpid;
+	mutex_lock(&uctx->lock);
+	list_add_tail(&entry->entry, &uctx->qpids);
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_release_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	struct list_head *pos, *nxt;
+	struct cxio_qpid_list *entry;
+
+	mutex_lock(&uctx->lock);
+	list_for_each_safe(pos, nxt, &uctx->qpids) {
+		entry = list_entry(pos, struct cxio_qpid_list, entry);
+		list_del_init(&entry->entry);
+		if (!(entry->qpid & rdev_p->qpmask))
+			cxio_hal_put_qpid(rdev_p->rscp, entry->qpid);
+		kfree(entry);
+	}
+	mutex_unlock(&uctx->lock);
+}
+
+void cxio_init_ucontext(struct cxio_rdev *rdev_p, struct cxio_ucontext *uctx)
+{
+	INIT_LIST_HEAD(&uctx->qpids);
+	mutex_init(&uctx->lock);
+}
+
+int cxio_create_qp(struct cxio_rdev *rdev_p, u32 kernel_domain,
+		   struct t3_wq *wq, struct cxio_ucontext *uctx)
+{
+	int depth = 1UL << wq->size_log2;
+	int rqsize = 1UL << wq->rq_size_log2;
+
+	wq->qpid = get_qpid(rdev_p, uctx);
+	if (!wq->qpid)
+		return -ENOMEM;
+
+	wq->rq = kzalloc(depth * sizeof(u64), GFP_KERNEL);
+	if (!wq->rq)
+		goto err1;
+
+	wq->rq_addr = cxio_hal_rqtpool_alloc(rdev_p, rqsize);
+	if (!wq->rq_addr)
+		goto err2;
+
+	wq->sq = kzalloc(depth * sizeof(struct t3_swsq), GFP_KERNEL);
+	if (!wq->sq)
+		goto err3;
+	
+	wq->queue = dma_alloc_coherent(&(rdev_p->rnic_info.pdev->dev),
+					     depth * sizeof(union t3_wr),
+					     &(wq->dma_addr), GFP_KERNEL);
+	if (!wq->queue)
+		goto err4;
+
+	memset(wq->queue, 0, depth * sizeof(union t3_wr));
+	pci_unmap_addr_set(wq, mapping, wq->dma_addr);
+	wq->doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	if (!kernel_domain)
+		wq->udb = (u64)rdev_p->rnic_info.udbell_physbase +
+					(wq->qpid << rdev_p->qpshift);
+	PDBG("%s qpid 0x%x doorbell 0x%p udb 0x%llx\n", __FUNCTION__,
+	     wq->qpid, wq->doorbell, wq->udb);
+	return 0;
+err4:
+	kfree(wq->sq);
+err3:
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, rqsize);
+err2:
+	kfree(wq->rq);
+err1:
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return -ENOMEM;
+}
+
+int cxio_destroy_cq(struct cxio_rdev *rdev_p, struct t3_cq *cq)
+{
+	int err;
+	err = cxio_hal_clear_cq_ctx(rdev_p, cq->cqid);
+	kfree(cq->sw_queue);
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (cq->size_log2))
+			  * sizeof(struct t3_cqe), cq->queue,
+			  pci_unmap_addr(cq, mapping));
+	cxio_hal_put_cqid(rdev_p->rscp, cq->cqid);
+	return err;
+}
+
+int cxio_destroy_qp(struct cxio_rdev *rdev_p, struct t3_wq *wq,
+		    struct cxio_ucontext *uctx)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << (wq->size_log2))
+			  * sizeof(union t3_wr), wq->queue,
+			  pci_unmap_addr(wq, mapping));
+	kfree(wq->sq);
+	cxio_hal_rqtpool_free(rdev_p, wq->rq_addr, (1UL << wq->rq_size_log2));
+	kfree(wq->rq);
+	put_qpid(rdev_p, wq->qpid, uctx);
+	return 0;
+}
+
+static void insert_recv_cqe(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__,
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) |
+			         V_CQE_OPCODE(T3_SEND) |
+		         	 V_CQE_TYPE(0) |
+		         	 V_CQE_SWCQE(1) |
+		         	 V_CQE_QPID(wq->qpid) |
+		         	 V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr,
+						       cq->size_log2)));
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	u32 ptr;
+
+	PDBG("%s wq %p cq %p\n", __FUNCTION__, wq, cq);
+
+	/* flush RQ */
+	PDBG("%s rq_rptr %u rq_wptr %u skip count %u\n", __FUNCTION__,
+	    wq->rq_rptr, wq->rq_wptr, count);
+	ptr = wq->rq_rptr + count;
+	while (ptr++ != wq->rq_wptr)
+		insert_recv_cqe(wq, cq);
+}
+
+static void insert_sq_cqe(struct t3_wq *wq, struct t3_cq *cq,
+		          struct t3_swsq *sqp)
+{
+	struct t3_cqe cqe;
+
+	PDBG("%s wq %p cq %p sw_rptr 0x%x sw_wptr 0x%x\n", __FUNCTION__,
+	     wq, cq, cq->sw_rptr, cq->sw_wptr);
+	memset(&cqe, 0, sizeof(cqe));
+	cqe.header = cpu_to_be32(V_CQE_STATUS(TPT_ERR_SWFLUSH) |
+			         V_CQE_OPCODE(sqp->opcode) |
+			         V_CQE_TYPE(1) |
+			         V_CQE_SWCQE(1) |
+			         V_CQE_QPID(wq->qpid) |
+			         V_CQE_GENBIT(Q_GENBIT(cq->sw_wptr,
+						       cq->size_log2)));
+	cqe.u.scqe.wrid_hi = sqp->sq_wptr;
+
+	*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2)) = cqe;
+	cq->sw_wptr++;
+}
+
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count)
+{
+	__u32 ptr;
+	struct t3_swsq *sqp = wq->sq + Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2);
+
+	ptr = wq->sq_rptr + count;
+	sqp += count;
+	while (ptr != wq->sq_wptr) {
+		insert_sq_cqe(wq, cq, sqp);
+		sqp++;
+		ptr++;
+	}
+}
+
+/*
+ * Move all CQEs from the HWCQ into the SWCQ.
+ */
+void cxio_flush_hw_cq(struct t3_cq *cq)
+{
+	struct t3_cqe *cqe, *swcqe;
+
+	PDBG("%s cq %p cqid 0x%x\n", __FUNCTION__, cq, cq->cqid);
+	cqe = cxio_next_hw_cqe(cq);
+	while (cqe) {
+		PDBG("%s flushing hwcq rptr 0x%x to swcq wptr 0x%x\n",
+		     __FUNCTION__, cq->rptr, cq->sw_wptr);
+		swcqe = cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2);
+		*swcqe = *cqe;
+		swcqe->header |= cpu_to_be32(V_CQE_SWCQE(1));
+		cq->sw_wptr++;
+		cq->rptr++;
+		cqe = cxio_next_hw_cqe(cq);
+	}
+}
+
+static inline int cqe_completes_wr(struct t3_cqe *cqe, struct t3_wq *wq)
+{
+	if (CQE_OPCODE(*cqe) == T3_TERMINATE)
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_RDMA_WRITE) && RQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_READ_RESP) && SQ_TYPE(*cqe))
+		return 0;
+
+	if ((CQE_OPCODE(*cqe) == T3_SEND) && RQ_TYPE(*cqe) &&
+	    Q_EMPTY(wq->rq_rptr, wq->rq_wptr))
+		return 0;
+
+	return 1;
+}
+
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if ((SQ_TYPE(*cqe) || (CQE_OPCODE(*cqe) == T3_READ_RESP)) &&
+		    (CQE_QPID(*cqe) == wq->qpid))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count)
+{
+	struct t3_cqe *cqe;
+	u32 ptr;
+
+	*count = 0;
+	PDBG("%s count zero %d\n", __FUNCTION__, *count);
+	ptr = cq->sw_rptr;
+	while (!Q_EMPTY(ptr, cq->sw_wptr)) {
+		cqe = cq->sw_queue + (Q_PTR2IDX(ptr, cq->size_log2));
+		if (RQ_TYPE(*cqe) && (CQE_OPCODE(*cqe) != T3_READ_RESP) &&
+		    (CQE_QPID(*cqe) == wq->qpid) && cqe_completes_wr(cqe, wq))
+			(*count)++;
+		ptr++;
+	}	
+	PDBG("%s cq %p count %d\n", __FUNCTION__, cq, *count);
+}
+
+static int cxio_hal_init_ctrl_cq(struct cxio_rdev *rdev_p)
+{
+	struct rdma_cq_setup setup;
+	setup.id = 0;
+	setup.base_addr = 0;	/* NULL address */
+	setup.size = 1;		/* enable the CQ */
+	setup.credits = 0;
+
+	/* force SGE to redirect to RspQ and interrupt */
+	setup.credit_thres = 0;	
+	setup.ovfl_mode = 1;
+	return (rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_CQ_SETUP, &setup));
+}
+
+static int cxio_hal_init_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	int err;
+	u64 sge_cmd, ctx0, ctx1;
+	u64 base_addr;
+	struct t3_modify_qp_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_KERNEL);
+
+
+	if (!skb) {
+		PDBG("%s alloc_skb failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	err = cxio_hal_init_ctrl_cq(rdev_p);
+	if (err) {
+		PDBG("%s err %d initializing ctrl_cq\n", __FUNCTION__, err);
+		return err;
+	}
+	rdev_p->ctrl_qp.workq = dma_alloc_coherent(
+					&(rdev_p->rnic_info.pdev->dev),
+					(1 << T3_CTRL_QP_SIZE_LOG2) *
+					sizeof(union t3_wr),
+					&(rdev_p->ctrl_qp.dma_addr),
+					GFP_KERNEL);
+	if (!rdev_p->ctrl_qp.workq) {
+		PDBG("%s dma_alloc_coherent failed\n", __FUNCTION__);
+		return -ENOMEM;
+	}
+	pci_unmap_addr_set(&rdev_p->ctrl_qp, mapping,
+			   rdev_p->ctrl_qp.dma_addr);
+	rdev_p->ctrl_qp.doorbell = (void __iomem *)rdev_p->rnic_info.kdb_addr;
+	memset(rdev_p->ctrl_qp.workq, 0,
+	       (1 << T3_CTRL_QP_SIZE_LOG2) * sizeof(union t3_wr));
+
+	init_MUTEX(&rdev_p->ctrl_qp.sem);
+	init_waitqueue_head(&rdev_p->ctrl_qp.waitq);
+
+	/* update HW Ctrl QP context */
+	base_addr = rdev_p->ctrl_qp.dma_addr;
+	base_addr >>= 12;
+	ctx0 = (V_EC_SIZE((1 << T3_CTRL_QP_SIZE_LOG2)) |
+		V_EC_BASE_LO((u32) base_addr & 0xffff));
+	ctx0 <<= 32;
+	ctx0 |= V_EC_CREDITS(FW_WR_NUM);
+	base_addr >>= 16;
+	ctx1 = (u32) base_addr;
+	base_addr >>= 32;
+	ctx1 |= ((u64) (V_EC_BASE_HI((u32) base_addr & 0xf) | V_EC_RESPQ(0) |
+			V_EC_TYPE(0) | V_EC_GEN(1) |
+			V_EC_UP_TOKEN(T3_CTL_QP_TID) | F_EC_VALID)) << 32;
+	wqe = (struct t3_modify_qp_wr *) skb_put(skb, sizeof(*wqe));
+	memset(wqe, 0, sizeof(*wqe));
+	build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_QP_MOD, 0, 1,
+		       T3_CTL_QP_TID, 7);
+	wqe->flags = cpu_to_be32(MODQP_WRITE_EC);
+	sge_cmd = (3ULL << 56) | FW_RI_SGEEC_START << 8 | 3;
+	wqe->sge_cmd = cpu_to_be64(sge_cmd);
+	wqe->ctx1 = cpu_to_be64(ctx1);
+	wqe->ctx0 = cpu_to_be64(ctx0);
+	PDBG("CtrlQP dma_addr 0x%llx workq %p size %d\n",
+	     (u64) rdev_p->ctrl_qp.dma_addr, rdev_p->ctrl_qp.workq,
+	     1 << T3_CTRL_QP_SIZE_LOG2);
+	skb->priority = CPL_PRIORITY_CONTROL;
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+static int cxio_hal_destroy_ctrl_qp(struct cxio_rdev *rdev_p)
+{
+	dma_free_coherent(&(rdev_p->rnic_info.pdev->dev),
+			  (1UL << T3_CTRL_QP_SIZE_LOG2)
+			  * sizeof(union t3_wr), rdev_p->ctrl_qp.workq,
+			  pci_unmap_addr(&rdev_p->ctrl_qp, mapping));
+	return cxio_hal_clear_qp_ctx(rdev_p, T3_CTRL_QP_ID);
+}
+
+/* write len bytes of data into addr (32B aligned address)
+ * If data is NULL, clear len byte of memory to zero.
+ * caller aquires the sem before the call
+ */
+static int cxio_hal_ctrl_qp_write_mem(struct cxio_rdev *rdev_p, u32 addr,
+				      u32 len, void *data, int completion)
+{
+	u32 i, nr_wqe, copy_len;
+	u8 *copy_data;
+	u8 wr_len, utx_len;	/* lenght in 8 byte flit */
+	enum t3_wr_flags flag;
+	__be64 *wqe;
+	u64 utx_cmd;
+	addr &= 0x7FFFFFF;
+	nr_wqe = len % 96 ? len / 96 + 1 : len / 96;	/* 96B max per WQE */
+	PDBG("%s wptr 0x%x rptr 0x%x len %d, nr_wqe %d data %p addr 0x%0x\n",
+	     __FUNCTION__, rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, len,
+	     nr_wqe, data, addr);
+	utx_len = 3;		/* in 32B unit */
+	for (i = 0; i < nr_wqe; i++) {
+		if (Q_FULL(rdev_p->ctrl_qp.rptr, rdev_p->ctrl_qp.wptr,
+		           T3_CTRL_QP_SIZE_LOG2)) {
+			PDBG("%s ctrl_qp full wtpr 0x%0x rptr 0x%0x, "
+			     "wait for more space i %d\n", __FUNCTION__,
+			     rdev_p->ctrl_qp.wptr, rdev_p->ctrl_qp.rptr, i);
+			if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     !Q_FULL(rdev_p->ctrl_qp.rptr,
+						     rdev_p->ctrl_qp.wptr,
+						     T3_CTRL_QP_SIZE_LOG2))) {
+				PDBG("%s ctrl_qp workq interrupted\n",
+				     __FUNCTION__);
+				return -ERESTARTSYS;
+			}
+			PDBG("%s ctrl_qp wakeup, continue posting work request "
+			     "i %d\n", __FUNCTION__, i);
+		}
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+						(1 << T3_CTRL_QP_SIZE_LOG2)));
+		flag = 0;
+		if (i == (nr_wqe - 1)) {
+			/* last WQE */
+			flag = completion ? T3_COMPLETION_FLAG : 0;
+			if (len % 32)
+				utx_len = len / 32 + 1;
+			else
+				utx_len = len / 32;
+		}
+
+		/*
+		 * Force a CQE to return the credit to the workq in case
+		 * we posted more than half the max QP size of WRs
+		 */
+		if ((i != 0) &&
+		    (i % (((1 << T3_CTRL_QP_SIZE_LOG2)) >> 1) == 0)) {
+			flag = T3_COMPLETION_FLAG;
+			PDBG("%s force completion at i %d\n", __FUNCTION__, i);
+		}
+
+		/* build the utx mem command */
+		wqe += (sizeof(struct t3_bypass_wr) >> 3);
+		utx_cmd = (T3_UTX_MEM_WRITE << 28) | (addr + i * 3);
+		utx_cmd <<= 32;
+		utx_cmd |= (utx_len << 28) | ((utx_len << 2) + 1);
+		*wqe = cpu_to_be64(utx_cmd);
+		wqe++;
+		copy_data = (u8 *) data + i * 96;
+		copy_len = len > 96 ? 96 : len;
+
+		/* clear memory content if data is NULL */
+		if (data)
+			memcpy(wqe, copy_data, copy_len);
+		else
+			memset(wqe, 0, copy_len);
+		if (copy_len % 32)
+			memset(((u8 *) wqe) + copy_len, 0,
+			       32 - (copy_len % 32));
+		wr_len = ((sizeof(struct t3_bypass_wr)) >> 3) + 1 +
+			 (utx_len << 2);
+		wqe = (__be64 *)(rdev_p->ctrl_qp.workq + (rdev_p->ctrl_qp.wptr %
+			      (1 << T3_CTRL_QP_SIZE_LOG2)));
+
+		/* wptr in the WRID[31:0] */
+		((union t3_wrid *)(wqe+1))->id0.low = rdev_p->ctrl_qp.wptr;
+
+		/*
+		 * This must be the last write with a memory barrier
+		 * for the genbit
+		 */
+		build_fw_riwrh((struct fw_riwrh *) wqe, T3_WR_BP, flag,
+			       Q_GENBIT(rdev_p->ctrl_qp.wptr,
+					T3_CTRL_QP_SIZE_LOG2), T3_CTRL_QP_ID,
+			       wr_len);
+		if (flag == T3_COMPLETION_FLAG)
+			ring_doorbell(rdev_p->ctrl_qp.doorbell, T3_CTRL_QP_ID);
+		len -= 96;
+		rdev_p->ctrl_qp.wptr++;
+	}
+	return 0;
+}
+
+/* IN: stag key, pdid, perm, zbva, to, len, page_size, pbl, and pbl_size
+ * OUT: stag index, actual pbl_size, pbl_addr allocated.
+ * TBD: shared memory region support
+ */
+static int __cxio_tpt_op(struct cxio_rdev *rdev_p, u32 reset_tpt_entry,
+			 u32 *stag, u8 stag_state, u32 pdid,
+			 enum tpt_mem_type type, enum tpt_mem_perm perm,
+			 u32 zbva, u64 to, u32 len, u8 page_size, __be64 *pbl,
+			 u32 *pbl_size, u32 *pbl_addr)
+{
+	int err;
+	struct tpt_entry tpt;
+	u32 stag_idx;
+	u32 wptr;
+	int rereg = (*stag != T3_STAG_UNSET);
+
+	stag_state = stag_state > 0;
+	stag_idx = (*stag) >> 8;
+
+	if ((!reset_tpt_entry) && !(*stag != T3_STAG_UNSET)) {
+		stag_idx = cxio_hal_get_stag(rdev_p->rscp);
+		if (!stag_idx)
+			return -ENOMEM;
+		*stag = (stag_idx << 8) | ((*stag) & 0xFF);
+	}
+	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n",
+	     __FUNCTION__, stag_state, type, pdid, stag_idx);
+	
+	if (reset_tpt_entry)
+		cxio_hal_pblpool_free(rdev_p, *pbl_addr, *pbl_size << 3);
+	else if (!rereg) {
+		*pbl_addr = cxio_hal_pblpool_alloc(rdev_p, *pbl_size << 3);
+		if (!*pbl_addr) {
+			return -ENOMEM;
+		}
+	}
+
+	down_interruptible(&rdev_p->ctrl_qp.sem);
+
+	/* write PBL first if any - update pbl only if pbl list exist */
+	if (pbl) {
+
+		PDBG("%s *pdb_addr 0x%x, pbl_base 0x%x, pbl_size %d\n",
+		     __FUNCTION__, *pbl_addr, rdev_p->rnic_info.pbl_base,
+		     *pbl_size);
+		err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				(*pbl_addr >> 5),
+				(*pbl_size << 3), pbl, 0);
+		if (err)
+			goto ret;
+	}
+
+	/* write TPT entry */
+	if (reset_tpt_entry)
+		memset(&tpt, 0, sizeof(tpt));
+	else {
+		tpt.valid_stag_pdid = cpu_to_be32(F_TPT_VALID |
+				V_TPT_STAG_KEY((*stag) & M_TPT_STAG_KEY) |
+				V_TPT_STAG_STATE(stag_state) |
+				V_TPT_STAG_TYPE(type) | V_TPT_PDID(pdid));
+		BUG_ON(page_size >= 28);
+		tpt.flags_pagesize_qpid = cpu_to_be32(V_TPT_PERM(perm) |
+			    	F_TPT_MW_BIND_ENABLE |
+				V_TPT_ADDR_TYPE((zbva ? TPT_ZBTO : TPT_VATO)) |
+				V_TPT_PAGE_SIZE(page_size));
+		tpt.rsvd_pbl_addr = reset_tpt_entry ? 0 :
+				    cpu_to_be32(V_TPT_PBL_ADDR(PBL_OFF(rdev_p, *pbl_addr)>>3));
+		tpt.len = cpu_to_be32(len);
+		tpt.va_hi = cpu_to_be32((u32) (to >> 32));
+		tpt.va_low_or_fbo = cpu_to_be32((u32) (to & 0xFFFFFFFFULL));
+		tpt.rsvd_bind_cnt_or_pstag = 0;
+		tpt.rsvd_pbl_size = reset_tpt_entry ? 0 :
+				  cpu_to_be32(V_TPT_PBL_SIZE((*pbl_size) >> 2));
+	}
+	err = cxio_hal_ctrl_qp_write_mem(rdev_p,
+				       stag_idx +
+				       (rdev_p->rnic_info.tpt_base >> 5),
+				       sizeof(tpt), &tpt, 1);
+
+	/* release the stag index to free pool */
+	if (reset_tpt_entry)
+		cxio_hal_put_stag(rdev_p->rscp, stag_idx);
+ret:	
+	wptr = rdev_p->ctrl_qp.wptr;
+	up(&rdev_p->ctrl_qp.sem);
+	if (!err)
+		if (wait_event_interruptible(rdev_p->ctrl_qp.waitq,
+					     SEQ32_GE(rdev_p->ctrl_qp.rptr,
+						      wptr)))
+			return -ERESTARTSYS;
+	return err;
+}
+
+/* IN : stag key, pdid, pbl_size
+ * Out: stag index, actaul pbl_size, and pbl_addr allocated.
+ */
+int cxio_allocate_stag(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return (__cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_NON_SHARED_MR,
+			      perm, 0, 0ULL, 0, 0, NULL, pbl_size, pbl_addr));
+}
+
+int cxio_register_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev_p, u32 *stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 0, stag, 1, pdid, TPT_NON_SHARED_MR, perm,
+			     zbva, to, len, page_size, pbl, pbl_size, pbl_addr);
+}
+
+int cxio_dereg_mem(struct cxio_rdev *rdev_p, u32 stag, u32 pbl_size,
+		   u32 pbl_addr)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     &pbl_size, &pbl_addr);
+}
+
+int cxio_allocate_window(struct cxio_rdev *rdev_p, u32 * stag, u32 pdid)
+{
+	u32 pbl_size = 0;
+	*stag = T3_STAG_UNSET;
+	return __cxio_tpt_op(rdev_p, 0, stag, 0, pdid, TPT_MW, 0, 0, 0ULL, 0, 0,
+			     NULL, &pbl_size, NULL);
+}
+
+int cxio_deallocate_window(struct cxio_rdev *rdev_p, u32 stag)
+{
+	return __cxio_tpt_op(rdev_p, 1, &stag, 0, 0, 0, 0, 0, 0ULL, 0, 0, NULL,
+			     NULL, NULL);
+}
+
+int cxio_rdma_init(struct cxio_rdev *rdev_p, struct t3_rdma_init_attr *attr)
+{
+	struct t3_rdma_init_wr *wqe;
+	struct sk_buff *skb = alloc_skb(sizeof(*wqe), GFP_ATOMIC);
+	if (!skb)
+		return -ENOMEM;
+	PDBG("%s rdev_p %p\n", __FUNCTION__, rdev_p);
+	wqe = (struct t3_rdma_init_wr *) __skb_put(skb, sizeof(*wqe));
+	wqe->wrh.op_seop_flags = cpu_to_be32(V_FW_RIWR_OP(T3_WR_INIT));
+	wqe->wrh.gen_tid_len = cpu_to_be32(V_FW_RIWR_TID(attr->tid) |
+					   V_FW_RIWR_LEN(sizeof(*wqe) >> 3));
+	wqe->wrid.id1 = 0;
+	wqe->qpid = cpu_to_be32(attr->qpid);
+	wqe->pdid = cpu_to_be32(attr->pdid);
+	wqe->scqid = cpu_to_be32(attr->scqid);
+	wqe->rcqid = cpu_to_be32(attr->rcqid);
+	wqe->rq_addr = cpu_to_be32(attr->rq_addr - rdev_p->rnic_info.rqt_base);
+	wqe->rq_size = cpu_to_be32(attr->rq_size);
+	wqe->mpaattrs = attr->mpaattrs;
+	wqe->qpcaps = attr->qpcaps;
+	wqe->ulpdu_size = cpu_to_be16(attr->tcp_emss);
+	wqe->flags = cpu_to_be32(attr->flags);
+	wqe->ord = cpu_to_be32(attr->ord);
+	wqe->ird = cpu_to_be32(attr->ird);
+	wqe->qp_dma_addr = cpu_to_be64(attr->qp_dma_addr);
+	wqe->qp_dma_size = cpu_to_be32(attr->qp_dma_size);
+	wqe->rsvd = 0;
+	skb->priority = 0;	/* 0=>ToeQ; 1=>CtrlQ */
+	return (cxgb3_ofld_send(rdev_p->t3cdev_p, skb));
+}
+
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = ev_cb;
+}
+
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb)
+{
+	cxio_ev_cb = NULL;
+}
+
+static int cxio_hal_ev_handler(struct t3cdev *t3cdev_p, struct sk_buff *skb)
+{
+	static int cnt;
+	struct cxio_rdev *rdev_p = NULL;
+	struct respQ_msg_t *rsp_msg = (struct respQ_msg_t *) skb->data;
+	PDBG("%d: %s cq_id 0x%x cq_ptr 0x%x genbit %0x overflow %0x an %0x"
+	     " se %0x notify %0x cqbranch %0x creditth %0x\n",
+	     cnt, __FUNCTION__, RSPQ_CQID(rsp_msg), RSPQ_CQPTR(rsp_msg),
+	     RSPQ_GENBIT(rsp_msg), RSPQ_OVERFLOW(rsp_msg), RSPQ_AN(rsp_msg),
+	     RSPQ_SE(rsp_msg), RSPQ_NOTIFY(rsp_msg), RSPQ_CQBRANCH(rsp_msg),
+	     RSPQ_CREDIT_THRESH(rsp_msg));
+	PDBG("CQE: QPID 0x%0x genbit %0x type 0x%0x status 0x%0x opcode %d "
+	     "len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n",
+	     CQE_QPID(rsp_msg->cqe), CQE_GENBIT(rsp_msg->cqe),
+	     CQE_TYPE(rsp_msg->cqe), CQE_STATUS(rsp_msg->cqe),
+	     CQE_OPCODE(rsp_msg->cqe), CQE_LEN(rsp_msg->cqe),
+	     CQE_WRID_HI(rsp_msg->cqe), CQE_WRID_LOW(rsp_msg->cqe));
+	rdev_p = (struct cxio_rdev *)t3cdev_p->ulp;
+	if (!rdev_p) {
+		PDBG("%s called by t3cdev %p with null ulp\n", __FUNCTION__,
+		     t3cdev_p);
+		return 0;
+	}
+	if (CQE_QPID(rsp_msg->cqe) == T3_CTRL_QP_ID) {
+		rdev_p->ctrl_qp.rptr = CQE_WRID_LOW(rsp_msg->cqe) + 1;
+		wake_up_interruptible(&rdev_p->ctrl_qp.waitq);
+		dev_kfree_skb_irq(skb);
+	} else if (CQE_QPID(rsp_msg->cqe) == 0xfff8)
+		dev_kfree_skb_irq(skb);
+	else if (cxio_ev_cb)
+		(*cxio_ev_cb) (rdev_p, skb);
+	else
+		dev_kfree_skb_irq(skb);
+	cnt++;
+	return 0;
+}
+
+/* Caller takes care of locking if needed */
+int cxio_rdev_open(struct cxio_rdev *rdev_p)
+{
+	struct net_device *netdev_p = NULL;
+	int err = 0;
+	if (strlen(rdev_p->dev_name)) {
+		if (cxio_hal_find_rdev_by_name(rdev_p->dev_name)) {
+			return -EBUSY;
+		}
+		netdev_p = dev_get_by_name(rdev_p->dev_name);
+		if (!netdev_p) {
+			return -EINVAL;
+		}
+		dev_put(netdev_p);
+	} else if (rdev_p->t3cdev_p) {
+		if (cxio_hal_find_rdev_by_t3cdev(rdev_p->t3cdev_p)) {
+			return -EBUSY;
+		}
+		netdev_p = rdev_p->t3cdev_p->lldev;
+		strncpy(rdev_p->dev_name, rdev_p->t3cdev_p->name,
+			T3_MAX_DEV_NAME_LEN);
+	} else {
+		PDBG("%s t3cdev_p or dev_name must be set\n", __FUNCTION__);
+		return -EINVAL;
+	}
+
+	if (cxio_hal_add_rdev(rdev_p))
+		return -ENOMEM;
+
+	PDBG("%s opening rnic dev %s\n", __FUNCTION__, rdev_p->dev_name);
+	memset(&rdev_p->ctrl_qp, 0, sizeof(rdev_p->ctrl_qp));
+	if (!rdev_p->t3cdev_p)
+		rdev_p->t3cdev_p = T3CDEV(netdev_p);
+	rdev_p->t3cdev_p->ulp = (void *) rdev_p;
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, RDMA_GET_PARAMS,
+					 &(rdev_p->rnic_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+	err = rdev_p->t3cdev_p->ctl(rdev_p->t3cdev_p, GET_PORTS,
+				    &(rdev_p->port_info));
+	if (err) {
+		printk(KERN_ERR "%s t3cdev_p(%p)->ctl returned error %d.\n",
+		     __FUNCTION__, rdev_p->t3cdev_p, err);
+		goto err1;
+	}
+
+	/*
+	 * qpshift is the number of bits to shift the qpid left in order
+	 * to get the correct address of the doorbell for that qp.
+	 */
+	cxio_init_ucontext(rdev_p, &rdev_p->uctx);
+	rdev_p->qpshift = PAGE_SHIFT -
+			  ilog2(65536 >>
+			            ilog2(rdev_p->rnic_info.udbell_len >>
+					      PAGE_SHIFT));
+	rdev_p->qpnr = rdev_p->rnic_info.udbell_len >> PAGE_SHIFT;
+	rdev_p->qpmask = (65536 >> ilog2(rdev_p->qpnr)) - 1;
+	PDBG("%s rnic %s info: tpt_base 0x%0x tpt_top 0x%0x num stags %d "
+	     "pbl_base 0x%0x pbl_top 0x%0x rqt_base 0x%0x, rqt_top 0x%0x\n",
+	     __FUNCTION__, rdev_p->dev_name, rdev_p->rnic_info.tpt_base,
+  	     rdev_p->rnic_info.tpt_top, cxio_num_stags(rdev_p),
+  	     rdev_p->rnic_info.pbl_base,
+  	     rdev_p->rnic_info.pbl_top, rdev_p->rnic_info.rqt_base,
+  	     rdev_p->rnic_info.rqt_top);
+	PDBG("udbell_len 0x%0x udbell_physbase 0x%lx kdb_addr %p qpshift %lu "
+	     "qpnr %d qpmask 0x%x\n",
+	     rdev_p->rnic_info.udbell_len,
+	     rdev_p->rnic_info.udbell_physbase, rdev_p->rnic_info.kdb_addr,
+	     rdev_p->qpshift, rdev_p->qpnr, rdev_p->qpmask);
+
+	err = cxio_hal_init_ctrl_qp(rdev_p);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing ctrl_qp.\n",
+		       __FUNCTION__, err);
+		goto err1;
+	}
+ 	err = cxio_hal_init_resource(rdev_p, cxio_num_stags(rdev_p), 0,
+				     0, T3_MAX_NUM_QP, T3_MAX_NUM_CQ,
+				     T3_MAX_NUM_PD);
+	if (err) {
+		printk(KERN_ERR "%s error %d initializing hal resources.\n",
+		       __FUNCTION__, err);
+		goto err2;
+	}
+ 	err = cxio_hal_pblpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing pbl mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err3;
+ 	}
+ 	err = cxio_hal_rqtpool_create(rdev_p);
+ 	if (err) {
+ 		printk(KERN_ERR "%s error %d initializing rqt mem pool.\n",
+ 		       __FUNCTION__, err);
+ 		goto err4;
+ 	}
+  	return 0;
+err4:
+ 	cxio_hal_pblpool_destroy(rdev_p);
+err3:
+ 	cxio_hal_destroy_resource(rdev_p->rscp);
+err2:
+	cxio_hal_destroy_ctrl_qp(rdev_p);
+err1:
+	cxio_hal_delete_rdev(rdev_p);
+	return err;
+}
+
+void cxio_rdev_close(struct cxio_rdev *rdev_p)
+{
+	if (rdev_p) {
+		cxio_hal_pblpool_destroy(rdev_p);
+		cxio_hal_rqtpool_destroy(rdev_p);
+		cxio_hal_delete_rdev(rdev_p);
+		rdev_p->t3cdev_p->ulp = NULL;
+		cxio_hal_destroy_ctrl_qp(rdev_p);
+		cxio_hal_destroy_resource(rdev_p->rscp);
+	}
+}
+
+int __init cxio_hal_init(void)
+{
+	if (cxio_hal_init_rhdl_resource(T3_MAX_NUM_RI))
+		return -ENOMEM;
+	memset(rdev_tbl, 0, T3_MAX_NUM_RNIC * sizeof(void *));
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, cxio_hal_ev_handler);
+	return 0;
+}
+
+void __exit cxio_hal_exit(void)
+{
+	int i;
+	t3_register_cpl_handler(CPL_ASYNC_NOTIF, NULL);
+	for (i = 0; i < T3_MAX_NUM_RNIC; i++)
+		cxio_rdev_close(rdev_tbl[i]);
+	cxio_hal_destroy_rhdl_resource();
+}
+
+static inline void flush_completed_wrs(struct t3_wq *wq, struct t3_cq *cq)
+{
+	struct t3_swsq *sqp;
+	__u32 ptr = wq->sq_rptr;
+	int count = Q_COUNT(wq->sq_rptr, wq->sq_wptr);
+	
+	sqp = wq->sq + Q_PTR2IDX(ptr, wq->sq_size_log2);
+	while (count--)
+		if (!sqp->signaled) {
+			ptr++;
+			sqp = wq->sq + Q_PTR2IDX(ptr,  wq->sq_size_log2);
+		} else if (sqp->complete) {
+
+			/*
+			 * Insert this completed cqe into the swcq.
+			 */
+			PDBG("%s moving cqe into swcq sq idx %ld cq idx %ld\n",
+			     __FUNCTION__, Q_PTR2IDX(ptr,  wq->sq_size_log2),
+			     Q_PTR2IDX(cq->sw_wptr, cq->size_log2));
+			sqp->cqe.header |= htonl(V_CQE_SWCQE(1));
+			*(cq->sw_queue + Q_PTR2IDX(cq->sw_wptr, cq->size_log2))
+				= sqp->cqe;
+			cq->sw_wptr++;
+			sqp->signaled = 0;
+			break;
+		} else
+			break;
+}
+
+static inline void create_read_req_cqe(struct t3_wq *wq,
+				       struct t3_cqe *hw_cqe,
+				       struct t3_cqe *read_cqe)
+{
+	read_cqe->u.scqe.wrid_hi = wq->oldest_read->sq_wptr;
+	read_cqe->len = wq->oldest_read->read_len;
+	read_cqe->header = htonl(V_CQE_QPID(CQE_QPID(*hw_cqe)) |
+				 V_CQE_SWCQE(SW_CQE(*hw_cqe)) |
+				 V_CQE_OPCODE(T3_READ_REQ) |
+				 V_CQE_TYPE(1));
+}
+
+/*
+ * Return a ptr to the next read wr in the SWSQ or NULL.
+ */
+static inline void advance_oldest_read(struct t3_wq *wq)
+{
+
+	u32 rptr = wq->oldest_read - wq->sq + 1;
+	u32 wptr = Q_PTR2IDX(wq->sq_wptr, wq->sq_size_log2);
+
+	while (Q_PTR2IDX(rptr, wq->sq_size_log2) != wptr) {
+		wq->oldest_read = wq->sq + Q_PTR2IDX(rptr, wq->sq_size_log2);
+
+		if (wq->oldest_read->opcode == T3_READ_REQ)
+			return;
+		rptr++;
+	}
+	wq->oldest_read = NULL;
+}
+
+/*
+ * cxio_poll_cq
+ *
+ * Caller must:
+ *     check the validity of the first CQE,
+ *     supply the wq assicated with the qpid.
+ *
+ * credit: cq credit to return to sge.
+ * cqe_flushed: 1 iff the CQE is flushed.
+ * cqe: copy of the polled CQE.
+ *
+ * return value:
+ *     0       CQE returned,
+ *    -1       CQE skipped, try again.
+ */
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit)
+{
+	int ret = 0;
+	struct t3_cqe *hw_cqe, read_cqe;
+
+	*cqe_flushed = 0;
+	*credit = 0;
+	hw_cqe = cxio_next_cqe(cq);
+
+	PDBG("%s CQE OOO %d qpid 0x%0x genbit %d type %d status 0x%0x"
+	     " opcode 0x%0x len 0x%0x wrid_hi_stag 0x%x wrid_low_msn 0x%x\n",
+	     __FUNCTION__, CQE_OOO(*hw_cqe), CQE_QPID(*hw_cqe),
+	     CQE_GENBIT(*hw_cqe), CQE_TYPE(*hw_cqe), CQE_STATUS(*hw_cqe),
+	     CQE_OPCODE(*hw_cqe), CQE_LEN(*hw_cqe), CQE_WRID_HI(*hw_cqe),
+	     CQE_WRID_LOW(*hw_cqe));
+
+	/*
+	 * skip cqe's not affiliated with a QP.
+	 */
+	if (wq == NULL) {
+		ret = -1;
+		goto skip_cqe;
+	}
+
+	/*
+	 * Gotta tweak READ completions:
+	 * 	1) the cqe doesn't contain the sq_wptr from the wr.
+	 *	2) opcode not reflected from the wr.
+	 *	3) read_len not reflected from the wr.
+	 *	4) cq_type is RQ_TYPE not SQ_TYPE.
+	 */
+	if (RQ_TYPE(*hw_cqe) && (CQE_OPCODE(*hw_cqe) == T3_READ_RESP)) {
+		
+		/*
+	 	 * Don't write to the HWCQ, so create a new read req CQE
+		 * in local memory.
+		 */
+		create_read_req_cqe(wq, hw_cqe, &read_cqe);
+		hw_cqe = &read_cqe;
+		advance_oldest_read(wq);
+	}
+
+	/*
+ 	 * T3A: Discard TERMINATE CQEs.
+	 */
+	if (CQE_OPCODE(*hw_cqe) == T3_TERMINATE) {
+		ret = -1;
+		wq->error = 1;
+		goto skip_cqe;
+	}
+
+	if (CQE_STATUS(*hw_cqe) || wq->error) {
+		*cqe_flushed = wq->error;
+		wq->error = 1;
+	
+		/*
+		 * T3A inserts errors into the CQE.  We cannot return
+	 	 * these as work completions.
+	 	 */
+		/* incoming write failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_RDMA_WRITE)
+		     && RQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		/* incoming read request failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_READ_RESP) && SQ_TYPE(*hw_cqe)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+
+		/* incoming SEND with no receive posted failures */
+		if ((CQE_OPCODE(*hw_cqe) == T3_SEND) && RQ_TYPE(*hw_cqe) &&
+		    Q_EMPTY(wq->rq_rptr, wq->rq_wptr)) {
+			ret = -1;
+			goto skip_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+	 * RECV completion.
+	 */
+	if (RQ_TYPE(*hw_cqe)) {
+
+		/*
+		 * HW only validates 4 bits of MSN.  So we must validate that
+		 * the MSN in the SEND is the next expected MSN.  If its not,
+		 * then we complete this with TPT_ERR_MSN and mark the wq in
+		 * error.
+		 */
+		if (unlikely((CQE_WRID_MSN(*hw_cqe) != (wq->rq_rptr + 1)))) {
+			wq->error = 1;
+			hw_cqe->header |= htonl(V_CQE_STATUS(TPT_ERR_MSN));
+			goto proc_cqe;
+		}
+		goto proc_cqe;
+	}
+
+	/*
+ 	 * If we get here its a send completion.
+	 *
+	 * Handle out of order completion. These get stuffed
+	 * in the SW SQ. Then the SW SQ is walked to move any
+	 * now in-order completions into the SW CQ.  This handles
+	 * 2 cases:
+	 * 	1) reaping unsignaled WRs when the first subsequent
+	 *	   signaled WR is completed.
+	 *	2) out of order read completions.
+	 */
+	if (!SW_CQE(*hw_cqe) && (CQE_WRID_SQ_WPTR(*hw_cqe) != wq->sq_rptr)) {
+		struct t3_swsq *sqp;
+
+		PDBG("%s out of order completion going in swsq at idx %ld\n",
+		     __FUNCTION__,
+		     Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2));
+		sqp = wq->sq +
+		      Q_PTR2IDX(CQE_WRID_SQ_WPTR(*hw_cqe), wq->sq_size_log2);
+		sqp->cqe = *hw_cqe;
+		sqp->complete = 1;
+		ret = -1;
+		goto flush_wq;
+	}
+	
+proc_cqe:
+	*cqe = *hw_cqe;
+
+	/*
+	 * Reap the associated WR(s) that are freed up with this
+	 * completion.
+	 */
+	if (SQ_TYPE(*hw_cqe)) {
+		wq->sq_rptr = CQE_WRID_SQ_WPTR(*hw_cqe);
+		PDBG("%s completing sq idx %ld\n", __FUNCTION__,
+		     Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2));
+		*cookie = (wq->sq +
+			   Q_PTR2IDX(wq->sq_rptr, wq->sq_size_log2))->wr_id;
+		wq->sq_rptr++;
+	} else {
+		PDBG("%s completing rq idx %ld\n", __FUNCTION__,
+		     Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		*cookie = *(wq->rq + Q_PTR2IDX(wq->rq_rptr, wq->rq_size_log2));
+		wq->rq_rptr++;
+	}
+
+flush_wq:
+	/*
+	 * Flush any completed cqes that are now in-order.
+	 */
+	flush_completed_wrs(wq, cq);
+
+skip_cqe:
+	if (SW_CQE(*hw_cqe)) {
+		PDBG("%s cq %p cqid 0x%x skip sw cqe sw_rptr 0x%x\n",
+		     __FUNCTION__, cq, cq->cqid, cq->sw_rptr);
+		++cq->sw_rptr;
+	} else {
+		PDBG("%s cq %p cqid 0x%x skip hw cqe rptr 0x%x\n",
+		     __FUNCTION__, cq, cq->cqid, cq->rptr);
+		++cq->rptr;
+
+		/*
+		 * T3A: compute credits.
+		 */
+		if (((cq->rptr - cq->wptr) > (1 << (cq->size_log2 - 1)))
+		    || ((cq->rptr - cq->wptr) >= 128)) {
+			*credit = cq->rptr - cq->wptr;
+			cq->wptr = cq->rptr;
+		}
+	}
+	return ret;
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_hal.h b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
new file mode 100644
index 0000000..e5e702d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_hal.h
@@ -0,0 +1,201 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef  __CXIO_HAL_H__
+#define  __CXIO_HAL_H__
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#include "t3_cpl.h"
+#include "t3cdev.h"
+#include "cxgb3_ctl_defs.h"
+#include "cxio_wr.h"
+
+#define T3_CTRL_QP_ID    FW_RI_SGEEC_START
+#define T3_CTL_QP_TID	 FW_RI_TID_START
+#define T3_CTRL_QP_SIZE_LOG2  8
+#define T3_CTRL_CQ_ID    0
+
+/* TBD */
+#define T3_MAX_NUM_RNIC  8
+#define T3_MAX_NUM_RI (1<<15)
+#define T3_MAX_NUM_QP (1<<15)
+#define T3_MAX_NUM_CQ (1<<15)
+#define T3_MAX_NUM_PD (1<<15)
+#define T3_MAX_PBL_SIZE 256
+#define T3_MAX_RQ_SIZE 1024
+#define T3_MAX_NUM_STAG (1<<15)
+
+#define T3_STAG_UNSET 0xffffffff
+
+#define T3_MAX_DEV_NAME_LEN 32
+
+struct cxio_hal_ctrl_qp {
+	u32 wptr;
+	u32 rptr;
+	struct semaphore sem;	/* for the wtpr, can sleep */
+	wait_queue_head_t waitq;	/* wait for RspQ/CQE msg */
+	union t3_wr *workq;	/* the work request queue */
+	dma_addr_t dma_addr;	/* pci bus address of the workq */
+	DECLARE_PCI_UNMAP_ADDR(mapping)
+	void __iomem *doorbell;
+};
+
+struct cxio_hal_resource {
+	struct kfifo *tpt_fifo;
+	spinlock_t tpt_fifo_lock;
+	struct kfifo *qpid_fifo;
+	spinlock_t qpid_fifo_lock;
+	struct kfifo *cqid_fifo;
+	spinlock_t cqid_fifo_lock;
+	struct kfifo *pdid_fifo;
+	spinlock_t pdid_fifo_lock;
+};
+
+struct cxio_qpid_list {
+	struct list_head entry;
+	u32 qpid;
+};
+
+struct cxio_ucontext {
+	struct list_head qpids;
+	struct mutex lock;
+};
+
+struct cxio_rdev {
+	char dev_name[T3_MAX_DEV_NAME_LEN];
+	struct t3cdev *t3cdev_p;
+	struct rdma_info rnic_info;
+	struct adap_ports port_info;
+	struct cxio_hal_resource *rscp;
+	struct cxio_hal_ctrl_qp ctrl_qp;
+	void *ulp;
+	unsigned long qpshift;
+	u32 qpnr;
+	u32 qpmask;
+	struct cxio_ucontext uctx;
+	struct gen_pool *pbl_pool;
+	struct gen_pool *rqt_pool;
+};
+
+static inline int cxio_num_stags(struct cxio_rdev *rdev_p)
+{
+	return min((int)T3_MAX_NUM_STAG, (int)((rdev_p->rnic_info.tpt_top - rdev_p->rnic_info.tpt_base) >> 5));
+}
+
+typedef void (*cxio_hal_ev_callback_func_t) (struct cxio_rdev * rdev_p,
+					     struct sk_buff * skb);
+
+#define RSPQ_CQID(rsp) (be32_to_cpu(rsp->cq_ptrid) & 0xffff)
+#define RSPQ_CQPTR(rsp) ((be32_to_cpu(rsp->cq_ptrid) >> 16) & 0xffff)
+#define RSPQ_GENBIT(rsp) ((be32_to_cpu(rsp->flags) >> 16) & 1)
+#define RSPQ_OVERFLOW(rsp) ((be32_to_cpu(rsp->flags) >> 17) & 1)
+#define RSPQ_AN(rsp) ((be32_to_cpu(rsp->flags) >> 18) & 1)
+#define RSPQ_SE(rsp) ((be32_to_cpu(rsp->flags) >> 19) & 1)
+#define RSPQ_NOTIFY(rsp) ((be32_to_cpu(rsp->flags) >> 20) & 1)
+#define RSPQ_CQBRANCH(rsp) ((be32_to_cpu(rsp->flags) >> 21) & 1)
+#define RSPQ_CREDIT_THRESH(rsp) ((be32_to_cpu(rsp->flags) >> 22) & 1)
+
+struct respQ_msg_t {
+	__be32 flags;		/* flit 0 */
+	__be32 cq_ptrid;
+	__be64 rsvd;		/* flit 1 */
+	struct t3_cqe cqe;	/* flits 2-3 */
+};
+
+enum t3_cq_opcode {
+	CQ_ARM_AN = 0x2,
+	CQ_ARM_SE = 0x6,
+	CQ_FORCE_AN = 0x3,
+	CQ_CREDIT_UPDATE = 0x7
+};
+
+int cxio_rdev_open(struct cxio_rdev *rdev);
+void cxio_rdev_close(struct cxio_rdev *rdev);
+int cxio_hal_cq_op(struct cxio_rdev *rdev, struct t3_cq *cq,
+	 	   enum t3_cq_opcode op, u32 credit);
+int cxio_hal_clear_qp_ctx(struct cxio_rdev *rdev, u32 qpid);
+int cxio_create_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_destroy_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+int cxio_resize_cq(struct cxio_rdev *rdev, struct t3_cq *cq);
+void cxio_release_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+void cxio_init_ucontext(struct cxio_rdev *rdev, struct cxio_ucontext *uctx);
+int cxio_create_qp(struct cxio_rdev *rdev, u32 kernel_domain, struct t3_wq *wq,
+		   struct cxio_ucontext *uctx);
+int cxio_destroy_qp(struct cxio_rdev *rdev, struct t3_wq *wq,
+		    struct cxio_ucontext *uctx);
+int cxio_peek_cq(struct t3_wq *wr, struct t3_cq *cq, int opcode);
+int cxio_allocate_stag(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+		       enum tpt_mem_perm perm, u32 * pbl_size, u32 * pbl_addr);
+int cxio_register_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_reregister_phys_mem(struct cxio_rdev *rdev, u32 * stag, u32 pdid,
+			   enum tpt_mem_perm perm, u32 zbva, u64 to, u32 len,
+			   u8 page_size, __be64 *pbl, u32 *pbl_size,
+			   u32 *pbl_addr);
+int cxio_dereg_mem(struct cxio_rdev *rdev, u32 stag, u32 pbl_size,
+		   u32 pbl_addr);
+int cxio_allocate_window(struct cxio_rdev *rdev, u32 * stag, u32 pdid);
+int cxio_deallocate_window(struct cxio_rdev *rdev, u32 stag);
+int cxio_rdma_init(struct cxio_rdev *rdev, struct t3_rdma_init_attr *attr);
+void cxio_register_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+void cxio_unregister_ev_cb(cxio_hal_ev_callback_func_t ev_cb);
+u32 cxio_hal_get_rhdl(void);
+void cxio_hal_put_rhdl(u32 rhdl);
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp);
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid);
+int __init cxio_hal_init(void);
+void __exit cxio_hal_exit(void);
+void cxio_flush_rq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_flush_sq(struct t3_wq *wq, struct t3_cq *cq, int count);
+void cxio_count_rcqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_count_scqes(struct t3_cq *cq, struct t3_wq *wq, int *count);
+void cxio_flush_hw_cq(struct t3_cq *cq);
+int cxio_poll_cq(struct t3_wq *wq, struct t3_cq *cq, struct t3_cqe *cqe,
+		     u8 *cqe_flushed, u64 *cookie, u32 *credit);
+
+#define MOD "iw_cxgb3: "
+#define PDBG(fmt, args...) pr_debug(MOD fmt, ## args)
+
+#ifdef DEBUG
+void cxio_dump_tpt(struct cxio_rdev *rev, u32 stag);
+void cxio_dump_pbl(struct cxio_rdev *rev, u32 pbl_addr, uint len, u8 shift);
+void cxio_dump_wqe(union t3_wr *wqe);
+void cxio_dump_wce(struct t3_cqe *wce);
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents);
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid);
+#endif
+
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:23:26 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:23:26 -0600
Subject: [openib-general] [PATCH v5 11/13] iw_cxgb3 Core Resource Allocation
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192326.19316.22402.stgit@dell3.ogc.int>


Core functions to carve up adapter memory, stag, qp, and cq IDs.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_resource.c |  331 ++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/core/cxio_resource.h |   70 +++++
 2 files changed, 401 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.c b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
new file mode 100644
index 0000000..d1d8722
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.c
@@ -0,0 +1,331 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+/* Crude resource management */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include "cxio_resource.h"
+#include "cxio_hal.h"
+
+static struct kfifo *rhdl_fifo;
+static spinlock_t rhdl_fifo_lock;
+
+#define RANDOM_SIZE 16
+
+static int __cxio_init_resource_fifo(struct kfifo **fifo,
+				   spinlock_t *fifo_lock,
+				   u32 nr, u32 skip_low,
+				   u32 skip_high,
+				   int random)
+{
+	u32 i, j, entry = 0, idx;
+	u32 random_bytes;
+	u32 rarray[16];
+	spin_lock_init(fifo_lock);
+
+	*fifo = kfifo_alloc(nr * sizeof(u32), GFP_KERNEL, fifo_lock);
+	if (IS_ERR(*fifo))
+		return -ENOMEM;
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		__kfifo_put(*fifo, (unsigned char *) &entry, sizeof(u32));
+	if (random) {
+		j = 0;
+		random_bytes = random32();
+		for (i = 0; i < RANDOM_SIZE; i++)
+			rarray[i] = i + skip_low;
+		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
+			if (j >= RANDOM_SIZE) {
+				j = 0;
+				random_bytes = random32();
+			}
+			idx = (random_bytes >> (j * 2)) & 0xF;
+			__kfifo_put(*fifo,
+				(unsigned char *) &rarray[idx],
+				sizeof(u32));
+			rarray[idx] = i;
+			j++;	
+		}
+		for (i = 0; i < RANDOM_SIZE; i++)
+			__kfifo_put(*fifo,
+				(unsigned char *) &rarray[i],
+				sizeof(u32));
+	} else
+		for (i = skip_low; i < nr - skip_high; i++)
+			__kfifo_put(*fifo, (unsigned char *) &i, sizeof(u32));
+
+	for (i = 0; i < skip_low + skip_high; i++)
+		kfifo_get(*fifo, (unsigned char *) &entry, sizeof(u32));
+	return 0;
+}
+
+static int cxio_init_resource_fifo(struct kfifo **fifo, spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low,
+					  skip_high, 0));
+}
+
+static int cxio_init_resource_fifo_random(struct kfifo **fifo,
+				   spinlock_t * fifo_lock,
+				   u32 nr, u32 skip_low, u32 skip_high)
+{
+
+	return (__cxio_init_resource_fifo(fifo, fifo_lock, nr, skip_low,
+					  skip_high, 1));
+}
+
+static int cxio_init_qpid_fifo(struct cxio_rdev *rdev_p)
+{
+	u32 i;
+
+	spin_lock_init(&rdev_p->rscp->qpid_fifo_lock);
+
+	rdev_p->rscp->qpid_fifo = kfifo_alloc(T3_MAX_NUM_QP * sizeof(u32),
+					      GFP_KERNEL,
+					      &rdev_p->rscp->qpid_fifo_lock);
+	if (IS_ERR(rdev_p->rscp->qpid_fifo))
+		return -ENOMEM;
+
+	for (i = 16; i < T3_MAX_NUM_QP; i++)
+		if (!(i & rdev_p->qpmask))
+			__kfifo_put(rdev_p->rscp->qpid_fifo,
+				    (unsigned char *) &i, sizeof(u32));
+	return 0;
+}
+
+int cxio_hal_init_rhdl_resource(u32 nr_rhdl)
+{
+	return cxio_init_resource_fifo(&rhdl_fifo, &rhdl_fifo_lock, nr_rhdl, 1,
+				       0);
+}
+
+void cxio_hal_destroy_rhdl_resource(void)
+{
+	kfifo_free(rhdl_fifo);
+}
+
+/* nr_* must be power of 2 */
+int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+			   u32 nr_tpt, u32 nr_pbl,
+			   u32 nr_rqt, u32 nr_qpid, u32 nr_cqid, u32 nr_pdid)
+{
+	int err = 0;
+	struct cxio_hal_resource *rscp;
+
+	rscp = kmalloc(sizeof(*rscp), GFP_KERNEL);
+	if (!rscp)
+		return -ENOMEM;
+	rdev_p->rscp = rscp;
+	err = cxio_init_resource_fifo_random(&rscp->tpt_fifo,
+				      &rscp->tpt_fifo_lock,
+				      nr_tpt, 1, 0);
+	if (err)
+		goto tpt_err;
+	err = cxio_init_qpid_fifo(rdev_p);
+	if (err)
+		goto qpid_err;
+	err = cxio_init_resource_fifo(&rscp->cqid_fifo, &rscp->cqid_fifo_lock,
+				      nr_cqid, 1, 0);
+	if (err)
+		goto cqid_err;
+	err = cxio_init_resource_fifo(&rscp->pdid_fifo, &rscp->pdid_fifo_lock,
+				      nr_pdid, 1, 0);
+	if (err)
+		goto pdid_err;
+	return 0;
+pdid_err:
+	kfifo_free(rscp->cqid_fifo);
+cqid_err:
+	kfifo_free(rscp->qpid_fifo);
+qpid_err:
+	kfifo_free(rscp->tpt_fifo);
+tpt_err:
+	return -ENOMEM;
+}
+
+/*
+ * returns 0 if no resource available
+ */
+static inline u32 cxio_hal_get_resource(struct kfifo *fifo)
+{
+	u32 entry;
+	if (kfifo_get(fifo, (unsigned char *) &entry, sizeof(u32)))
+		return entry;
+	else
+		return 0;	/* fifo emptry */
+}
+
+static inline void cxio_hal_put_resource(struct kfifo *fifo, u32 entry)
+{
+	BUG_ON(kfifo_put(fifo, (unsigned char *) &entry, sizeof(u32)) == 0);
+}
+
+u32 cxio_hal_get_rhdl(void)
+{
+	return cxio_hal_get_resource(rhdl_fifo);
+}
+
+void cxio_hal_put_rhdl(u32 rhdl)
+{
+	cxio_hal_put_resource(rhdl_fifo, rhdl);
+}
+
+u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->tpt_fifo);
+}
+
+void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag)
+{
+	cxio_hal_put_resource(rscp->tpt_fifo, stag);
+}
+
+u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp)
+{
+	u32 qpid = cxio_hal_get_resource(rscp->qpid_fifo);
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	return qpid;
+}
+
+void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid)
+{
+	PDBG("%s qpid 0x%x\n", __FUNCTION__, qpid);
+	cxio_hal_put_resource(rscp->qpid_fifo, qpid);
+}
+
+u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->cqid_fifo);
+}
+
+void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid)
+{
+	cxio_hal_put_resource(rscp->cqid_fifo, cqid);
+}
+
+u32 cxio_hal_get_pdid(struct cxio_hal_resource *rscp)
+{
+	return cxio_hal_get_resource(rscp->pdid_fifo);
+}
+
+void cxio_hal_put_pdid(struct cxio_hal_resource *rscp, u32 pdid)
+{
+	cxio_hal_put_resource(rscp->pdid_fifo, pdid);
+}
+
+void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp)
+{
+	kfifo_free(rscp->tpt_fifo);
+	kfifo_free(rscp->cqid_fifo);
+	kfifo_free(rscp->qpid_fifo);
+	kfifo_free(rscp->pdid_fifo);
+	kfree(rscp);
+}
+
+/*
+ * PBL Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_PBL_SHIFT 8			/* 256B == min PBL size (32 entries) */
+#define PBL_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->pbl_pool, size);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size);
+	return (u32)addr;
+}
+
+void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size);
+	gen_pool_free(rdev_p->pbl_pool, (unsigned long)addr, size);
+}
+
+int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->pbl_pool = gen_pool_create(MIN_PBL_SHIFT, -1);
+	if (rdev_p->pbl_pool)
+		for (i = rdev_p->rnic_info.pbl_base;
+		     i <= rdev_p->rnic_info.pbl_top - PBL_CHUNK + 1;
+		     i += PBL_CHUNK)
+			gen_pool_add(rdev_p->pbl_pool, i, PBL_CHUNK, -1);
+	return rdev_p->pbl_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->pbl_pool);
+}
+
+/*
+ * RQT Memory Manager.  Uses Linux generic allocator.
+ */
+
+#define MIN_RQT_SHIFT 10	/* 1KB == mini RQT size (16 entries) */
+#define RQT_CHUNK 2*1024*1024 		
+
+u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size)
+{
+	unsigned long addr = gen_pool_alloc(rdev_p->rqt_pool, size << 6);
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, (u32)addr, size << 6);
+	return (u32)addr;
+}
+
+void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size)
+{
+	PDBG("%s addr 0x%x size %d\n", __FUNCTION__, addr, size << 6);
+	gen_pool_free(rdev_p->rqt_pool, (unsigned long)addr, size << 6);
+}
+
+int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p)
+{
+	unsigned long i;
+	rdev_p->rqt_pool = gen_pool_create(MIN_RQT_SHIFT, -1);
+	if (rdev_p->rqt_pool)
+		for (i = rdev_p->rnic_info.rqt_base;
+		     i <= rdev_p->rnic_info.rqt_top - RQT_CHUNK + 1;
+		     i += RQT_CHUNK)
+			gen_pool_add(rdev_p->rqt_pool, i, RQT_CHUNK, -1);
+	return rdev_p->rqt_pool ? 0 : -ENOMEM;
+}
+
+void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p)
+{
+	gen_pool_destroy(rdev_p->rqt_pool);
+}
diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_resource.h b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
new file mode 100644
index 0000000..a6bbe83
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_resource.h
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifndef __CXIO_RESOURCE_H__
+#define __CXIO_RESOURCE_H__
+
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include <linux/slab.h>
+#include <linux/kfifo.h>
+#include <linux/spinlock.h>
+#include <linux/errno.h>
+#include <linux/genalloc.h>
+#include "cxio_hal.h"
+
+extern int cxio_hal_init_rhdl_resource(u32 nr_rhdl);
+extern void cxio_hal_destroy_rhdl_resource(void);
+extern int cxio_hal_init_resource(struct cxio_rdev *rdev_p,
+				  u32 nr_tpt, u32 nr_pbl,
+				  u32 nr_rqt, u32 nr_qpid, u32 nr_cqid,
+				  u32 nr_pdid);
+extern u32 cxio_hal_get_stag(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_stag(struct cxio_hal_resource *rscp, u32 stag);
+extern u32 cxio_hal_get_qpid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_qpid(struct cxio_hal_resource *rscp, u32 qpid);
+extern u32 cxio_hal_get_cqid(struct cxio_hal_resource *rscp);
+extern void cxio_hal_put_cqid(struct cxio_hal_resource *rscp, u32 cqid);
+extern void cxio_hal_destroy_resource(struct cxio_hal_resource *rscp);
+
+#define PBL_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.pbl_base )
+extern int cxio_hal_pblpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_pblpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_pblpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_pblpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+
+#define RQT_OFF(rdev_p, a) ( (a) - (rdev_p)->rnic_info.rqt_base )
+extern int cxio_hal_rqtpool_create(struct cxio_rdev *rdev_p);
+extern void cxio_hal_rqtpool_destroy(struct cxio_rdev *rdev_p);
+extern u32 cxio_hal_rqtpool_alloc(struct cxio_rdev *rdev_p, int size);
+extern void cxio_hal_rqtpool_free(struct cxio_rdev *rdev_p, u32 addr, int size);
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:23:56 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:23:56 -0600
Subject: [openib-general] [PATCH  v5 12/13] iw_cxgb3 Core Debug functions
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192356.19316.82880.stgit@dell3.ogc.int>


Debug code to dump various data structs, some of which are in 
adapter memory.

Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/hw/cxgb3/core/cxio_dbg.c |  205 +++++++++++++++++++++++++++
 1 files changed, 205 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
new file mode 100644
index 0000000..dfaa704
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/core/cxio_dbg.c
@@ -0,0 +1,205 @@
+/*
+ * Copyright (c) 2006 Chelsio, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#ifdef DEBUG
+#include <linux/types.h>
+#include "common.h"
+#include "cxgb3_ioctl.h"
+#include "cxio_hal.h"
+#include "cxio_wr.h"
+
+void cxio_dump_tpt(struct cxio_rdev *rdev, u32 stag)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size = 32;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = (stag>>8) * 32 + rdev->rnic_info.tpt_base;
+	m->len = size;
+	PDBG("%s TPT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("TPT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_pbl(struct cxio_rdev *rdev, u32 pbl_addr, uint len, u8 shift)
+{
+	struct ch_mem_range *m;
+	u64 *data;
+	int rc;
+	int size, npages;
+
+	shift += 12;
+	npages = (len + (1ULL << shift) - 1) >> shift;
+	size = npages * sizeof(u64);
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = pbl_addr;
+	m->len = size;
+	PDBG("%s PBL addr 0x%x len %d depth %d\n",
+		__FUNCTION__, m->addr, m->len, npages);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("PBL %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_wqe(union t3_wr *wqe)
+{
+	__be64 *data = (__be64 *)wqe;
+	uint size = (uint)(be64_to_cpu(*data) & 0xff);
+
+	if (size == 0)
+		size = 8;
+	while (size > 0) {
+		PDBG("WQE %p: %016llx\n", data, be64_to_cpu(*data));
+		size--;
+		data++;
+	}
+}
+
+void cxio_dump_wce(struct t3_cqe *wce)
+{
+	__be64 *data = (__be64 *)wce;
+	int size = sizeof(*wce);
+
+	while (size > 0) {
+		PDBG("WCE %p: %016llx\n", data, be64_to_cpu(*data));
+		size -= 8;
+		data++;
+	}
+}
+
+void cxio_dump_rqt(struct cxio_rdev *rdev, u32 hwtid, int nents)
+{
+	struct ch_mem_range *m;
+	int size = nents * 64;
+	u64 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_PMRX;
+	m->addr = ((hwtid)<<10) + rdev->rnic_info.rqt_base;
+	m->len = size;
+	PDBG("%s RQT addr 0x%x len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u64 *)m->buf;
+	while (size > 0) {
+		PDBG("RQT %08x: %016llx\n", m->addr, (u64)*data);
+		size -= 8;
+		data++;
+		m->addr += 8;
+	}
+	kfree(m);
+}
+
+void cxio_dump_tcb(struct cxio_rdev *rdev, u32 hwtid)
+{
+	struct ch_mem_range *m;
+	int size = TCB_SIZE;
+	u32 *data;
+	int rc;
+
+	m = kmalloc(sizeof(*m) + size, GFP_ATOMIC);
+	if (!m) {
+		PDBG("%s couldn't allocate memory.\n", __FUNCTION__);
+		return;
+	}
+	m->mem_id = MEM_CM;
+	m->addr = hwtid * size;
+	m->len = size;
+	PDBG("%s TCB %d len %d\n", __FUNCTION__, m->addr, m->len);
+	rc = rdev->t3cdev_p->ctl(rdev->t3cdev_p, RDMA_GET_MEM, m);
+	if (rc) {
+		PDBG("%s toectl returned error %d\n", __FUNCTION__, rc);
+		kfree(m);
+		return;
+	}
+
+	data = (u32 *)m->buf;
+	while (size > 0) {
+		printk("%2u: %08x %08x %08x %08x %08x %08x %08x %08x\n",
+			m->addr,
+			*(data+2), *(data+3), *(data),*(data+1),
+			*(data+6), *(data+7), *(data+4), *(data+5));
+		size -= 32;
+		data += 8;
+		m->addr += 32;
+	}
+	kfree(m);
+}
+#endif


From swise at opengridcomputing.com  Wed Dec 20 11:24:26 2006
From: swise at opengridcomputing.com (Steve Wise)
Date: Wed, 20 Dec 2006 13:24:26 -0600
Subject: [openib-general] [PATCH  v5 13/13] iw_cxgb3 Kconfig/Makefile
In-Reply-To: <20061220191754.19316.4914.stgit@dell3.ogc.int>
References: <20061220191754.19316.4914.stgit@dell3.ogc.int>
Message-ID: <20061220192426.19316.34290.stgit@dell3.ogc.int>


Signed-off-by: Steve Wise <swise at opengridcomputing.com>
---

 drivers/infiniband/Kconfig           |    1 +
 drivers/infiniband/Makefile          |    1 +
 drivers/infiniband/hw/cxgb3/Kconfig  |   27 +++++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb3/Makefile |   12 ++++++++++++
 4 files changed, 41 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 59b3932..06453ab 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,7 @@ source "drivers/infiniband/hw/mthca/Kcon
 source "drivers/infiniband/hw/ipath/Kconfig"
 source "drivers/infiniband/hw/ehca/Kconfig"
 source "drivers/infiniband/hw/amso1100/Kconfig"
+source "drivers/infiniband/hw/cxgb3/Kconfig"
 
 source "drivers/infiniband/ulp/ipoib/Kconfig"
 
diff --git a/drivers/infiniband/Makefile b/drivers/infiniband/Makefile
index 570b30a..69bdd55 100644
--- a/drivers/infiniband/Makefile
+++ b/drivers/infiniband/Makefile
@@ -3,6 +3,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)		+= hw/mt
 obj-$(CONFIG_INFINIBAND_IPATH)		+= hw/ipath/
 obj-$(CONFIG_INFINIBAND_EHCA)		+= hw/ehca/
 obj-$(CONFIG_INFINIBAND_AMSO1100)	+= hw/amso1100/
+obj-$(CONFIG_INFINIBAND_CXGB3)		+= hw/cxgb3/
 obj-$(CONFIG_INFINIBAND_IPOIB)		+= ulp/ipoib/
 obj-$(CONFIG_INFINIBAND_SRP)		+= ulp/srp/
 obj-$(CONFIG_INFINIBAND_ISER)		+= ulp/iser/
diff --git a/drivers/infiniband/hw/cxgb3/Kconfig b/drivers/infiniband/hw/cxgb3/Kconfig
new file mode 100644
index 0000000..d3db264
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Kconfig
@@ -0,0 +1,27 @@
+config INFINIBAND_CXGB3
+	tristate "Chelsio RDMA Driver"
+	depends on CHELSIO_T3 && INFINIBAND
+	select GENERIC_ALLOCATOR
+	---help---
+	  This is an iWARP/RDMA driver for the Chelsio T3 1GbE and
+	  10GbE adapters.
+
+	  For general information about Chelsio and our products, visit
+	  our website at <http://www.chelsio.com>.
+
+	  For customer support, please visit our customer support page at
+	  <http://www.chelsio.com/support.htm>.
+
+	  Please send feedback to <linux-bugs at chelsio.com>.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called iw_cxgb3.
+
+config INFINIBAND_CXGB3_DEBUG
+	bool "Verbose debugging output"
+	depends on INFINIBAND_CXGB3
+	default n
+	---help---
+	  This option causes the Chelsio RDMA driver to produce copious
+	  amounts of debug messages.  Select this if you are developing
+	  the driver or trying to diagnose a problem.
diff --git a/drivers/infiniband/hw/cxgb3/Makefile b/drivers/infiniband/hw/cxgb3/Makefile
new file mode 100644
index 0000000..7a89f6d
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb3/Makefile
@@ -0,0 +1,12 @@
+EXTRA_CFLAGS += -I$(TOPDIR)/drivers/net/cxgb3 \
+		-I$(TOPDIR)/drivers/infiniband/hw/cxgb3/core 
+
+obj-$(CONFIG_INFINIBAND_CXGB3) += iw_cxgb3.o
+
+iw_cxgb3-y :=  iwch_cm.o iwch_ev.o iwch_cq.o iwch_qp.o iwch_mem.o \
+	       iwch_provider.o iwch.o core/cxio_hal.o core/cxio_resource.o
+
+ifdef CONFIG_INFINIBAND_CXGB3_DEBUG
+EXTRA_CFLAGS += -DDEBUG -g 
+iw_cxgb3-y += core/cxio_dbg.o
+endif


From vuhuong at mellanox.com  Wed Dec 20 13:23:25 2006
From: vuhuong at mellanox.com (Vu Pham)
Date: Wed, 20 Dec 2006 13:23:25 -0800
Subject: [openib-general] opensm
In-Reply-To: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com>
References: <01B9E81EECACE94DBBD0A556E768FB8A01159E51@NAMAIL2.ad.lsil.com>
Message-ID: <4589A9CD.7070502@mellanox.com>

Hi Ashish,

> Hi,
> Please see the information below
> 
> This is what I did:
> /etc/init.d/openibd start
> /etc/init.d/opensmd  start
> modprobe ib_srp
> 
> Issued the command /usr/local/ofed/sbin/ibsrpdm -c    to get the
> information about target and used them in 
> 

By default without -d option, ibsrpdm will use 
/dev/infiniband/umad0  -- with corresponding to port 1 of mthca0

> echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
>  
> dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114
> 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target

This is correct by using srp-mthca0-1; however, I got this 
from your previous email which you reported *I am seeing the 
error “ Got failed path rec status -110 ” on Linux console*

echo 
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 
 > /sys/class/infiniband_srp/srp-mthca0-2/add_target

You used port 2 of mthca0 here ie. srp-mthca0-2; therefore, 
you got pathrecord failure

Please retry:
0. Make sure you connect port 1 of host hca to target (since 
you connect them directly. Port 2 work as well but you have 
to use the umad1 and srp-mthca0-2 for steps 1,2 below)
1. ibsrpdm -c -d /dev/infiniband/umad0
2. echo whatever target discover to srp-mthca0-1

-vu
> 
> Yes, earlier I had silverstorm switch which was running SM but now I
> have taken that out and directly connecting the target and host.
> 
> I have only one port connected between the host and the target. 
> The reason behind link is not stable is that I am restarting and
> stopping again and again, as this does not seem to be working and I did
> not know the issue until I looked at the console log which was
> indicating "Got failed path rec status -110" and after seeing that I
> searched on goggle and found that
> "https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht
> ml" it seems to be a bug with 64-bit machine.
> BTW, my linux server is 64-bit.
> When I hooked up 32-bit server running OFED-1.1, I see my target
> discovered with the same procedure.
> 
> So, whole question is that what is the fix for issue "Got failed path
> rec status -110" on 64-bit machine.
> 
> Thanks
> Ashish
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, December 19, 2006 10:35 PM
> To: Batwara, Ashish
> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> Subject: RE: [openib-general] opensm
> 
> On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote:
>> Hi,
>> Please look towards the end of the attached file.
> 
> What options are you starting opensm with ? What is the command line ?
> 
> Also, it looks like (at least at one point) you have another SM on the
> subnet. What is the make (vendor) for your switch ?
> 
> I see many SM port is DOWN. What is going on with this port ? Why is the
> physical link not LinkUp and stable ? That is the main issue and is
> likely why the SubnGet of NodeInfo is not being responded to.
> 
> -- Hal
> 
>> Thanks
>> Ashish
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:halr at voltaire.com] 
>> Sent: Tuesday, December 19, 2006 5:06 PM
>> To: Batwara, Ashish
>> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
>> Subject: Re: [openib-general] opensm
>>
>> Ashish,
>>
>> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
>>> Hi,
>>>
>>> Here is the info that you have asked. I am seeing the Subnet manager
>>> is up now having the port active. But server is not able to discover
>>> the target. I am seeing the error "Got failed path rec status -110"
> on
>>> Linux console. 
>> That means the request for an SA PathRecord from the initiator to the
>> target failed (-110 is ETIMEDOUT). Are you sure the target is up
>> (ACTIVE) on the subnet ? If it is, can you send the opensm log ?
>>
>> -- Hal
>>
>>> Below are the output of different commands. I am using following to
>>> discover the target:
>>>
>>>  
>>>
>>> /etc/init.d/opensmd start
>>>
>>> /etc/init.d/openibd start
>>>
>>> modprobe ib_srp
>>>
>>> echo
>>>
> id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
>> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
>> /sys/class/infiniband_srp/srp-mthca0-2/add_target 
>>>  
>>>
>>>  
>>>
>>> [root at p49 ~]# ibv_devinfo
>>>
>>> hca_id: mthca0
>>>
>>>         fw_ver:                         5.1.400
>>>
>>>         node_guid:                      0002:c902:0022:cce0
>>>
>>>         sys_image_guid:                 0002:c902:0022:cce3
>>>
>>>         vendor_id:                      0x02c9
>>>
>>>         vendor_part_id:                 25218
>>>
>>>         hw_ver:                         0xA0
>>>
>>>         board_id:                       MT_0370130002
>>>
>>>         phys_port_cnt:                  2
>>>
>>>                 port:   1
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>                 port:   2
>>>
>>>                         state:                  PORT_ACTIVE (4)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             2048 (4)
>>>
>>>                         sm_lid:                 1
>>>
>>>                         port_lid:               1
>>>
>>>                         port_lmc:               0x00
>>> hca_id: mthca1
>>>
>>>         fw_ver:                         5.1.400
>>>
>>>         node_guid:                      0002:c902:0022:cd2c
>>>
>>>         sys_image_guid:                 0002:c902:0022:cd2f
>>>
>>>         vendor_id:                      0x02c9
>>>
>>>         vendor_part_id:                 25218
>>>
>>>         hw_ver:                         0xA0
>>>
>>>         board_id:                       MT_0370130002
>>>
>>>         phys_port_cnt:                  2
>>>
>>>                 port:   1
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>                 port:   2
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>  
>>>
>>> [root at p49 ~]# uname -a
>>>
>>> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
>>> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>  
>>>
>>> [root at p49 ~]# cat /etc/infiniband/info
>>>
>>> #!/bin/bash
>>>
>>>  
>>>
>>> echo prefix=/usr/local/ofed
>>>
>>> echo Kernel=2.6.9-42.0.3.ELsmp
>>>
>>> echo
>>>
>>> echo "Configure options: --with-dapl --with-ipoibtools
> --with-libibcm
>>> --with-libibcommon --with-libibmad --with-libibumad
> --with-libibverbs
>>> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
>>> --with-libsdp --with-openib-diags --with-srptools --with-mstflint
>>> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
>>> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
>>> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
>>>
>>> echo
>>>
>>>  
>>>
>>> OFED Version: OFED-1.1
>>
>>
>>> Thanks
>>>
>>> Ashish
>>>
>>> -----Original Message-----
>>> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
>>> Sent: Tuesday, December 19, 2006 5:18 AM
>>> To: Batwara, Ashish
>>> Cc: ishai at mellanox.co.il; openib-general at openib.org
>>> Subject: Re: [openib-general] opensm
>>>
>>>  
>>>
>>> Hi Ashish,
>>>
>>>  
>>>
>>> SRP people say they have no such error message.
>>>
>>> OpenSM does. So I take it back.
>>>
>>>  
>>>
>>> Ashish,
>>>
>>> Please provide more into:
>>>
>>>  
>>>
>>> 1. ibv_devinfo
>>>
>>> 2. Version of code you are using
>>>
>>> 3. Command line you use for starting opensm
>>>
>>> 4. /var/log/osm.log
>>>
>>>  
>>>
>>> Thanks and sorry for the confusion.
>>>
>>>  
>>>
>>> EZ
>>>
>>>  
>>>
>>> Eitan Zahavi wrote:
>>>
>>>> This is not an OpenSM issue.
>>>> Forwarded to the SRP people.
>>>> EZ
>>>> Batwara, Ashish wrote:
>>>>   
>>>>> Hi,
>>>>> I am trying to run opensm on Linux server. It has two HCAs
>>> (4-ports) and
>>>
>>>>> connected to IB Switch. ibnodes command displays the information
>>> about
>>>
>>>>> the Switch ports and HCA ports.
>>>>> When I start opensm, I see in /var/log/messages "Starting
>>> srp_daemon"
>>>
>>>>> for all the 4 ports and immediately after I see "failed
> srp_daemon"
>>> for
>>>
>>>>> all the ports and the displays "SM Port is down".
>>>>> I tried several times and even rebooted the server few times but
> no
>>>>> luck.
>>>>> Does anybody know what this problem is?
>>>>> Thanks
>>>>> Ashish
>>>>> _______________________________________________
>>>>> openib-general mailing list
>>>>> openib-general at openib.org
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>>>   
>>>>>     
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>>   
>>>  
>>>
>>>
>>>
>>>
> ______________________________________________________________________
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From notice at ebay.com  Wed Dec 20 13:29:02 2006
From: notice at ebay.com (eBay Member : laptopsandmore-online)
Date: Wed, 20 Dec 2006 13:29:02 -0800
Subject: [openib-general] Question about item #320063773598 DELL Latitude
 C640 P4 1.8GHz Laptop Wireless XP Pro DVD
Message-ID: <20061220213701.9E6683B000D@sentry-two.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061220/20a9af97/attachment.html>

From eeb at bartonsoftware.com  Wed Dec 20 14:22:13 2006
From: eeb at bartonsoftware.com (Eric Barton)
Date: Wed, 20 Dec 2006 22:22:13 GMT
Subject: [openib-general] IB_CM_REJ_INVALID_SERVICE_ID
Message-ID: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com>


Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID for any other
reason than the peer isn't listening with the correct service number?

I've had the following bug report...

> We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use the 
> binaries provides by CFS and use OFED 1.1 as the IB stack.
> 
> At several times some of the clients hang during fs mount or when an OST 
> is added (see log).
> Error:
> LustreError: 1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected()) 10.0.90.8 at o2ib 
> rejected: reason 8, size 148
> 
> from OFED:
> enum ib_cm_rej_reason {
>        IB_CM_REJ_INVALID_SERVICE_ID            = 8,
> 
> Once an IPoIB ping is started to the corresponding OST the client 
> continues. Afterwards it is quite stable.

...which seems to be saying that just doing an IPoIB ping to the server was
enough to make rdma_connect() work OK.

-- 

                Cheers,
                        Eric


From halr at voltaire.com  Wed Dec 20 14:59:52 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 20 Dec 2006 17:59:52 -0500
Subject: [openib-general] [query]requirement of 'process_mad' in the HCA
 driver
In-Reply-To: <309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com>
References: <2875.47466.qm@web8317.mail.in.yahoo.com>
	<1166104604.28709.126501.camel@hal.voltaire.com>
	<309a667c0612180017g44d9be7dn9cb00dffaa081dd3@mail.gmail.com>
Message-ID: <1166655590.4519.70241.camel@hal.voltaire.com>

On Mon, 2006-12-18 at 03:17, Devesh Sharma wrote: 
> On similar lines I have a confusion about the mad agent creation:-
>  there is a function in mad.c   ib_agent_port_open() which creates
> _send_only_ SMAs for GSI and SMI per port.
> 
> There is a function in mthca_mad.c mthca_create_agents() which is
> _again_ createing two send only mad agents for SMI and GSI.
> 
> Why this driver specific agent creation is required?

Those agents handle the locally generated traps for the mthca (to be
sent up to the SM).

-- Hal

> On 14 Dec 2006 08:57:11 -0500, Hal Rosenstock <halr at voltaire.com> wrote:
> > On Wed, 2006-12-13 at 22:49, keshetti mahesh wrote:
> > > thanks for your reply,
> > >
> > > >The driver is needed to obtain the information for the IB node to
> > > fill
> > > >in the MADs for response to the SMA query. It may also issue some
> > > traps.
> > > >Similarly for PMA as well.
> > >
> > > Do u mean to say that HCA driver is needed to pass the HCA related
> > > information (like GID, GUID, port_info etc..) to the SMA so that it
> > > can reply to query(or GET ) MADs.
> >
> > Yes.
> >
> > >  Isn't SMA capable of doing the same by using "query_(gid, pkey,
> > > port)" verbs.
> >
> > One reason I can think of is that not all the needed information is
> > available via verbs. I think there are some others as well.
> >
> > > And final  questions  if it is really required to implement
> > > 'process_mad' in HCA driver then why it is not specified in the IB
> > > specifications.
> >
> > IB spec is architecture not implementation.
> >
> > > Whose duty is this (replying to query MADs) according to the IB
> > > psec.s(its duty of SMA right?)
> >
> > Depends on the MAD but if you are referring to the SMA queries, then yes
> > it is the SMA's responsibility.
> >
> > > I have observed that process_mad is not implemented in the IBM's eHCA
> > > driver. what is the case with it?
> >
> > With eHCA, QP0 is not exposed to the host (at least currently) and the
> > SMA is totally implemented in firmware.
> >
> > > PS: I am considering only SMA in the host s/w here.
> >
> > This is a design choice.
> >
> > -- Hal
> >
> > > regards,
> > > K.Mahesh.
> > >
> > >
> > >
> > >
> > > Hal Rosenstock <halr at voltaire.com> wrote:
> > >         On Wed, 2006-12-13 at 01:55, keshetti mahesh wrote:
> > >         > Hello all,
> > >         >
> > >         > I want to know from u people that isi it necessary to
> > >         implement the
> > >         > process_mad for a HCA.
> > >         >
> > >         > After looking into the implementations of process_mad in
> > >         ipath and
> > >         > mthca drivers i have fount that they are used to reply the
> > >         MADs with
> > >         > port_info,gid_info,sm_info etc..
> > >         >
> > >         > But isn't it handled by SMA in the host......
> > >
> > >         The SMA can either be in the host on in firmware (as is
> > >         typical with the
> > >         Mellanox silicon).
> > >
> > >         > i am little bit confused now .
> > >         > please just whether it is required to implement process_mad
> > >         (suppose)
> > >         > for new HCA driver....
> > >
> > >         It is. For an example of a host (software SMA), see
> > >         drivers/infiniband/hw/ipath/ipath_mad.c
> > >
> > >         > if it is required why?
> > >
> > >         The driver is needed to obtain the information for the IB node
> > >         to fill
> > >         in the MADs for response to the SMA query. It may also issue
> > >         some traps.
> > >         Similarly for PMA as well.
> > >
> > >         -- Hal
> > >
> > >         > Please CC your replies to me.
> > >         >
> > >         > regards,
> > >         > K.Mahesh.
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         >
> > >         ______________________________________________________________________
> > >         > Find out what India is talking about on - Yahoo! Answers
> > >         India
> > >         > Send FREE SMS to your friend's mobile from Yahoo! Messenger
> > >         Version 8.
> > >         > Get it NOW
> > >         >
> > >         >
> > >         ______________________________________________________________________
> > >         >
> > >         > _______________________________________________
> > >         > openib-general mailing list
> > >         > openib-general at openib.org
> > >         > http://openib.org/mailman/listinfo/openib-general
> > >         >
> > >         > To unsubscribe, please visit
> > >         http://openib.org/mailman/listinfo/openib-general
> > >
> > >
> > >
> > > ______________________________________________________________________
> > >  Find out what India is talking about on - Yahoo! Answers India
> > > Send FREE SMS to your friend's mobile from Yahoo! Messenger Version 8.
> > > Get it NOW
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >
> >


From dotanb at dev.mellanox.co.il  Wed Dec 20 22:30:57 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 21 Dec 2006 08:30:57 +0200
Subject: [openib-general] RDMA to shared memory causing corruption
In-Reply-To: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com>
References: <2cfcf21e0612200846t41231b45qec26d6f9f9a01a8@mail.gmail.com>
Message-ID: <458A2A21.10709@dev.mellanox.co.il>

Hi Steven.

Steven Wooding wrote:
> Hi,
>  
> I need some advice on a problem I've got RDMAing some data into a 
> shared memory segment.
>  
> Everything works great until I try to transfer a message of 294Kbytes 
> or larger in size. There is some management info in the top end of the 
> share memory segment (we're using Boost shm library). This management 
> area gets corrupted after the RDMA transfer has occurred.
>  
> I've tried various things to try and debug this. Allocating more 
> memory than I need from the shared memory segment for the landing 
> buffer. Making whole shared memory segment larger, and making the 
> management area smaller. But always I'm hit by this 294K limit. I 
> don't know whether it's a problem with Boost shmem or with RDMA 
> writing to memory areas that it shouldn't.
What is the problem that you are facing?
Failure in memory registration? completion with error?

which driver are you using?

thanks
Dotan


From aviram at dev.mellanox.co.il  Thu Dec 21 04:36:58 2006
From: aviram at dev.mellanox.co.il (Aviram Gutman)
Date: Thu, 21 Dec 2006 14:36:58 +0200
Subject: [openib-general] iSER target
In-Reply-To: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com>
References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com>
Message-ID: <458A7FEA.7070707@dev.mellanox.co.il>

Are you planning to have the iSER target over verbs or kDAPL? Isn't the 
kDAPL development halted?

Aviram

Dan Bar Dov wrote:
> The iser target code in the gen2 branch is functional
> over kdapl. It requires an iscsi target code above it,
> however such an iscsi code is not open.
>
> It was opened as a precursor for an open-source iscsi/iser-target
> project. That project is still in its early stages, and the plan is
> to add iser-target support, loosly based on the open-iser-target 
> code, to the stgt project.
>
> Due to the above, there is no readme/installation guide.
>
> Dan
>
>   
>> -----Original Message-----
>> From: openib-general-bounces at openib.org 
>> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal
>> Sent: Wednesday, December 20, 2006 4:03 AM
>> To: openib-general at openib.org
>> Subject: [openib-general] iSER target
>>
>> Hi,
>>
>>     I would like to confirm if the iSER target code in the gen2 branch
>> is functional. If yes, is there a readme/installation guide 
>> available...
>>
>> Thanks a lot!
>>
>> Vishal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>
>>     
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Thu Dec 21 06:08:11 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 09:08:11 -0500
Subject: [openib-general] OpenSM/osm_ucast_mgr.c: In
 osm_ucast_mgr_set_fwd_table, always reset port state change when set
Message-ID: <1166710089.4519.112824.camel@hal.voltaire.com>

OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset
port state change when set

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
index f663d2d..f546c5f 100644
--- a/osm/opensm/osm_ucast_mgr.c
+++ b/osm/opensm/osm_ucast_mgr.c
@@ -922,7 +922,7 @@ osm_ucast_mgr_set_fwd_table(
   else
     life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
 
-  if (life_state != si.life_state)
+  if ( (life_state != si.life_state) || ib_switch_info_get_state_change( &si ) )
   {
     set_swinfo_require = TRUE;
     si.life_state = life_state;


From monis at voltaire.com  Thu Dec 21 06:43:10 2006
From: monis at voltaire.com (Moni Shoua)
Date: Thu, 21 Dec 2006 16:43:10 +0200
Subject: [openib-general] [PATCH v3] IB_mthca HCA profile module
	parameters
In-Reply-To: <adaodq4ig88.fsf@cisco.com>
References: <457BF221.8080701@voltaire.com> <adaodq4ig88.fsf@cisco.com>
Message-ID: <458A9D7E.9080801@voltaire.com>

Roland Dreier wrote:
> OK, the patch below is what I ended up committing.  I am really not
> pleased with the patch you sent and expected me to include -- there
> are really obvious simple-to-fix things that it's just ridiculous for
> you to be sending, eg:
> 
>  > +MODULE_PARM_DESC(num_mpt, 
> 
> trailing whitespace -- please check that your patch applies with 'git
> apply --check --whitespace=error-all'
> 
>  > +		"maximum number of memory protection pable entries per HCA");
> 
> umm, 'pable'??
> 
> and plenty of other things...
> 
> For some reason I felt guilty about letting this patch hang for so
> long, and so I fixed it up, but after doing it this time, I'm not
> going to spend my time like that again.  I have plenty of work to do
> without cleaning up other people's messes...
> 
>     IB/mthca: Add HCA profile module parameters
>     
>     Add module parameters that enable settting some of the HCA
>     profile values, such as the number of QPs, CQs, etc.
>     
>     Signed-off-by: Leonid Arsh <leonida at voltaire.com>
>     Signed-off-by: Moni Shoua <monis at voltaire.com>
>     Signed-off-by: Roland Dreier <rolandd at cisco.com>
> 
> diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
> index 0491ec7..711c1b8 100644
> --- a/drivers/infiniband/hw/mthca/mthca_main.c
> +++ b/drivers/infiniband/hw/mthca/mthca_main.c
> @@ -82,22 +82,59 @@ MODULE_PARM_DESC(tune_pci, "increase PCI burst from the default set by BIOS if n
>  
>  struct mutex mthca_device_mutex;
>  
> +#define MTHCA_DEFAULT_NUM_QP            (1 << 16)
> +#define MTHCA_DEFAULT_RDB_PER_QP        (1 << 2)
> +#define MTHCA_DEFAULT_NUM_CQ            (1 << 16)
> +#define MTHCA_DEFAULT_NUM_MCG           (1 << 13)
> +#define MTHCA_DEFAULT_NUM_MPT           (1 << 17)
> +#define MTHCA_DEFAULT_NUM_MTT           (1 << 20)
> +#define MTHCA_DEFAULT_NUM_UDAV          (1 << 15)
> +#define MTHCA_DEFAULT_NUM_RESERVED_MTTS (1 << 18)
> +#define MTHCA_DEFAULT_NUM_UARC_SIZE     (1 << 18)
> +
> +static struct mthca_profile hca_profile = {
> +	.num_qp             = MTHCA_DEFAULT_NUM_QP,
> +	.rdb_per_qp         = MTHCA_DEFAULT_RDB_PER_QP,
> +	.num_cq             = MTHCA_DEFAULT_NUM_CQ,
> +	.num_mcg            = MTHCA_DEFAULT_NUM_MCG,
> +	.num_mpt            = MTHCA_DEFAULT_NUM_MPT,
> +	.num_mtt            = MTHCA_DEFAULT_NUM_MTT,
> +	.num_udav           = MTHCA_DEFAULT_NUM_UDAV,          /* Tavor only */
> +	.fmr_reserved_mtts  = MTHCA_DEFAULT_NUM_RESERVED_MTTS, /* Tavor only */
> +	.uarc_size          = MTHCA_DEFAULT_NUM_UARC_SIZE,     /* Arbel only */
> +};
> +
> +module_param_named(num_qp, hca_profile.num_qp, int, 0444);
> +MODULE_PARM_DESC(num_qp, "maximum number of QPs per HCA");
> +
> +module_param_named(rdb_per_qp, hca_profile.rdb_per_qp, int, 0444);
> +MODULE_PARM_DESC(rdb_per_qp, "number of RDB buffers per QP");
> +
> +module_param_named(num_cq, hca_profile.num_cq, int, 0444);
> +MODULE_PARM_DESC(num_cq, "maximum number of CQs per HCA");
> +
> +module_param_named(num_mcg, hca_profile.num_mcg, int, 0444);
> +MODULE_PARM_DESC(num_mcg, "maximum number of multicast groups per HCA");
> +
> +module_param_named(num_mpt, hca_profile.num_mpt, int, 0444);
> +MODULE_PARM_DESC(num_mpt,
> +		"maximum number of memory protection table entries per HCA");
> +
> +module_param_named(num_mtt, hca_profile.num_mtt, int, 0444);
> +MODULE_PARM_DESC(num_mtt,
> +		 "maximum number of memory translation table segments per HCA");
> +
> +module_param_named(num_udav, hca_profile.num_udav, int, 0444);
> +MODULE_PARM_DESC(num_udav, "maximum number of UD address vectors per HCA");
> +
> +module_param_named(fmr_reserved_mtts, hca_profile.fmr_reserved_mtts, int, 0444);
> +MODULE_PARM_DESC(fmr_reserved_mtts,
> +		 "number of memory translation table segments reserved for FMR");
> +
>  static const char mthca_version[] __devinitdata =
>  	DRV_NAME ": Mellanox InfiniBand HCA driver v"
>  	DRV_VERSION " (" DRV_RELDATE ")\n";
>  
> -static struct mthca_profile default_profile = {
> -	.num_qp		   = 1 << 16,
> -	.rdb_per_qp	   = 4,
> -	.num_cq		   = 1 << 16,
> -	.num_mcg	   = 1 << 13,
> -	.num_mpt	   = 1 << 17,
> -	.num_mtt	   = 1 << 20,
> -	.num_udav	   = 1 << 15,	/* Tavor only */
> -	.fmr_reserved_mtts = 1 << 18,	/* Tavor only */
> -	.uarc_size	   = 1 << 18,	/* Arbel only */
> -};
> -
>  static int mthca_tune_pci(struct mthca_dev *mdev)
>  {
>  	int cap;
> @@ -303,7 +340,7 @@ static int mthca_init_tavor(struct mthca_dev *mdev)
>  		goto err_disable;
>  	}
>  
> -	profile = default_profile;
> +	profile = hca_profile;
>  	profile.num_uar   = dev_lim.uar_size / PAGE_SIZE;
>  	profile.uarc_size = 0;
>  	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
> @@ -621,7 +658,7 @@ static int mthca_init_arbel(struct mthca_dev *mdev)
>  		goto err_stop_fw;
>  	}
>  
> -	profile = default_profile;
> +	profile = hca_profile;
>  	profile.num_uar  = dev_lim.uar_size / PAGE_SIZE;
>  	profile.num_udav = 0;
>  	if (mdev->mthca_flags & MTHCA_FLAG_SRQ)
> @@ -1278,11 +1315,57 @@ static struct pci_driver mthca_driver = {
>  	.remove		= __devexit_p(mthca_remove_one)
>  };
>  
> +static void __init __mthca_check_profile_val(const char *name, int *pval,
> +					     int pval_default)
> +{
> +	/* value must be positive and power of 2 */
> +	int old_pval = *pval;
> +
> +	if (old_pval <= 0)
> +		*pval = pval_default;
> +	else
> +		*pval = roundup_pow_of_two(old_pval);
> +
> +	if (old_pval != *pval) {
> +		printk(KERN_WARNING PFX "Invalid value %d for %s in module parameter.\n",
> +		       old_pval, name);
> +		printk(KERN_WARNING PFX "Corrected %s to %d.\n", name, *pval);
> +	}
> +}
> +
> +#define mthca_check_profile_val(name, default)				\
> +	__mthca_check_profile_val(#name, &hca_profile.name, default)
> +
> +static void __init mthca_validate_profile(void)
> +{
> +	mthca_check_profile_val(num_qp,            MTHCA_DEFAULT_NUM_QP);
> +	mthca_check_profile_val(rdb_per_qp,        MTHCA_DEFAULT_RDB_PER_QP);
> +	mthca_check_profile_val(num_cq,            MTHCA_DEFAULT_NUM_CQ);
> +	mthca_check_profile_val(num_mcg, 	   MTHCA_DEFAULT_NUM_MCG);
> +	mthca_check_profile_val(num_mpt, 	   MTHCA_DEFAULT_NUM_MPT);
> +	mthca_check_profile_val(num_mtt, 	   MTHCA_DEFAULT_NUM_MTT);
> +	mthca_check_profile_val(num_udav,          MTHCA_DEFAULT_NUM_UDAV);
> +	mthca_check_profile_val(fmr_reserved_mtts, MTHCA_DEFAULT_NUM_RESERVED_MTTS);
> +
> +	if (hca_profile.fmr_reserved_mtts >= hca_profile.num_mtt) {
> +		printk(KERN_WARNING PFX "Invalid fmr_reserved_mtts module parameter %d.\n",
> +		       hca_profile.fmr_reserved_mtts);
> +		printk(KERN_WARNING PFX "(Must be smaller than num_mtt %d)\n",
> +		       hca_profile.num_mtt);
> +		hca_profile.fmr_reserved_mtts = hca_profile.num_mtt / 2;
> +		printk(KERN_WARNING PFX "Corrected fmr_reserved_mtts to %d.\n",
> +		       hca_profile.fmr_reserved_mtts);
> +	}
> +}
> +
>  static int __init mthca_init(void)
>  {
>  	int ret;
>  
>  	mutex_init(&mthca_device_mutex);
> +
> +	mthca_validate_profile();
> +
>  	ret = mthca_catas_init();
>  	if (ret)
>  		return ret;
> 

OK.
Roland, 
Thanks for your help. 
I accepts the criticism and I hope to submit better patches next time.

 
From eitan at mellanox.co.il  Thu Dec 21 06:59:45 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 21 Dec 2006 16:59:45 +0200
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <20061220165624.GL31149@sgi.com>
References: <20061220165624.GL31149@sgi.com>
Message-ID: <458AA161.5090708@mellanox.co.il>

Hi Chris,

Sorry for my late response on this:

The simulator is a standalone "server" where clients connect to it 
through a TCP/IP socket.

OpenSM which is not built with "sim" vendor (using --with-osmv=sim 
--with-sim=<prefix used for ibmgtsim install>)
will not try to connect to the simulator but will go to the real IB 
network instead.

So you need a second "simulator" install of OpenSM.
You can simply clone the GIT tree and
./autogen.sh
./configure --with-osmv=sim --with-sim=<prefix used for ibmgtsim 
install> --prefix=<somewhere>
make
make install

RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o 
<somewhere>/bin/opensm

Actually OsmTest is a test that currently fail (due to last changes in 
InformInfo),
but any other *.check.tcl/*.sim.tcl pair should work.

 Eitan


Chris Elmquist wrote:
> Folks,
>
> I am trying to build and run IBMgtsim so that I can explore some different
> topologies and system sizes.  But I am having a lot of trouble getting
> OpenSM to work with the simulator.
>
> I pulled down Eitan's ibutils git tree (to get the simulator) and
> am otherwise using the OFED 1.1 tarball for the rest of the stuff.
> I suspect I have a problem with OpenSM not being built correctly to use
> the simulator.
>
> Does anyone have a recipe on how to build and install all of these pieces
> (ie, openib, openSM and ibmgtsim) so that they will work together?
>
> I have been just trying to run one of the tests provided with the
> simulator like this:
>
> % cd ~/ibutils/ibmgtsim/tests
> % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm
>
> but we get this sort of output:
>
> -I- Using random seed:43204
> -I- Simulation directory is: /tmp/ibmgtsim.29716
> -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top
> o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log
> -I- Simulator Ready
> -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726
> 5 
> -I- Connected to the simulator control server
> -I- Defined 51 guids
> -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a
>  2}
> -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
> -I- Waiting for OpenSM subnet up ...
> -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: 
> ERR 5422: Unable to find requested CA guid 0x2c90000000009
> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5
> 424: Unable to Open Port 0x2c90000000009
> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: 
> ERR 3118: Vendor specific bind failed
> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10:
>  SM MAD Controller bind failed (IB_ERROR)
> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind
> : ERR 1A11: No previous bind
> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>
> Thank you.
>
> Chris
> SGI Network Engineering
>   


From erezz at voltaire.com  Thu Dec 21 07:07:58 2006
From: erezz at voltaire.com (Erez Zilber)
Date: Thu, 21 Dec 2006 17:07:58 +0200
Subject: [openib-general] iSER target
In-Reply-To: <458A7FEA.7070707@dev.mellanox.co.il>
References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com>
	<458A7FEA.7070707@dev.mellanox.co.il>
Message-ID: <458AA34E.60206@voltaire.com>

No. We plan to run the iSER target over gen2 verbs.

-- 

____________________________________________________________

Erez Zilber | 972-9-971-7689

Software Engineer, Storage Team

Voltaire – _The Grid Backbone_

__

www.voltaire.com <http://www.voltaire.com/>


Aviram Gutman wrote:
> Are you planning to have the iSER target over verbs or kDAPL? Isn't the 
> kDAPL development halted?
>
> Aviram
>
> Dan Bar Dov wrote:
>   
>> The iser target code in the gen2 branch is functional
>> over kdapl. It requires an iscsi target code above it,
>> however such an iscsi code is not open.
>>
>> It was opened as a precursor for an open-source iscsi/iser-target
>> project. That project is still in its early stages, and the plan is
>> to add iser-target support, loosly based on the open-iser-target 
>> code, to the stgt project.
>>
>> Due to the above, there is no readme/installation guide.
>>
>> Dan
>>
>>   
>>     
>>> -----Original Message-----
>>> From: openib-general-bounces at openib.org 
>>> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal
>>> Sent: Wednesday, December 20, 2006 4:03 AM
>>> To: openib-general at openib.org
>>> Subject: [openib-general] iSER target
>>>
>>> Hi,
>>>
>>>     I would like to confirm if the iSER target code in the gen2 branch
>>> is functional. If yes, is there a readme/installation guide 
>>> available...
>>>
>>> Thanks a lot!
>>>
>>> Vishal
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit 
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>
>>>     
>>>       
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>   
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
>   


From halr at voltaire.com  Thu Dec 21 07:11:20 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 10:11:20 -0500
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <458AA161.5090708@mellanox.co.il>
References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il>
Message-ID: <1166713879.4519.115782.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-21 at 09:59, Eitan Zahavi wrote:
> Hi Chris,
> 
> Sorry for my late response on this:
> 
> The simulator is a standalone "server" where clients connect to it 
> through a TCP/IP socket.
> 
> OpenSM which is not built with "sim" vendor (using --with-osmv=sim 
> --with-sim=<prefix used for ibmgtsim install>)
> will not try to connect to the simulator but will go to the real IB 
> network instead.
> 
> So you need a second "simulator" install of OpenSM.
> You can simply clone the GIT tree and
> ./autogen.sh
> ./configure --with-osmv=sim --with-sim=<prefix used for ibmgtsim 
> install> --prefix=<somewhere>
> make
> make install
> 
> RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o 
> <somewhere>/bin/opensm

You might want to put this info up on the wiki.

-- Hal

> Actually OsmTest is a test that currently fail (due to last changes in 
> InformInfo),
> but any other *.check.tcl/*.sim.tcl pair should work.
> 
>  Eitan
> 
> 
> Chris Elmquist wrote:
> > Folks,
> >
> > I am trying to build and run IBMgtsim so that I can explore some different
> > topologies and system sizes.  But I am having a lot of trouble getting
> > OpenSM to work with the simulator.
> >
> > I pulled down Eitan's ibutils git tree (to get the simulator) and
> > am otherwise using the OFED 1.1 tarball for the rest of the stuff.
> > I suspect I have a problem with OpenSM not being built correctly to use
> > the simulator.
> >
> > Does anyone have a recipe on how to build and install all of these pieces
> > (ie, openib, openSM and ibmgtsim) so that they will work together?
> >
> > I have been just trying to run one of the tests provided with the
> > simulator like this:
> >
> > % cd ~/ibutils/ibmgtsim/tests
> > % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm
> >
> > but we get this sort of output:
> >
> > -I- Using random seed:43204
> > -I- Simulation directory is: /tmp/ibmgtsim.29716
> > -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top
> > o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log
> > -I- Simulator Ready
> > -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726
> > 5 
> > -I- Connected to the simulator control server
> > -I- Defined 51 guids
> > -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a
> >  2}
> > -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
> > -I- Waiting for OpenSM subnet up ...
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: 
> > ERR 5422: Unable to find requested CA guid 0x2c90000000009
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5
> > 424: Unable to Open Port 0x2c90000000009
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: 
> > ERR 3118: Vendor specific bind failed
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10:
> >  SM MAD Controller bind failed (IB_ERROR)
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind
> > : ERR 1A11: No previous bind
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >
> > Thank you.
> >
> > Chris
> > SGI Network Engineering
> >   
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Thu Dec 21 07:29:13 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 10:29:13 -0500
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <458AA161.5090708@mellanox.co.il>
References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il>
Message-ID: <1166714952.4519.116610.camel@hal.voltaire.com>

On Thu, 2006-12-21 at 09:59, Eitan Zahavi wrote:
> Hi Chris,
> 
> Sorry for my late response on this:
> 
> The simulator is a standalone "server" where clients connect to it 
> through a TCP/IP socket.
> 
> OpenSM which is not built with "sim" vendor (using --with-osmv=sim 
> --with-sim=<prefix used for ibmgtsim install>)
> will not try to connect to the simulator but will go to the real IB 
> network instead.
> 
> So you need a second "simulator" install of OpenSM.
> You can simply clone the GIT tree and
> ./autogen.sh
> ./configure --with-osmv=sim --with-sim=<prefix used for ibmgtsim 
> install> --prefix=<somewhere>
> make
> make install
> 
> RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o 
> <somewhere>/bin/opensm
> 
> Actually OsmTest is a test that currently fail (due to last changes in 
> InformInfo),

This could easily be worked around by commenting out those tests in
osmtest.c.

-- Hal

> but any other *.check.tcl/*.sim.tcl pair should work.
> 
>  Eitan
> 
> 
> Chris Elmquist wrote:
> > Folks,
> >
> > I am trying to build and run IBMgtsim so that I can explore some different
> > topologies and system sizes.  But I am having a lot of trouble getting
> > OpenSM to work with the simulator.
> >
> > I pulled down Eitan's ibutils git tree (to get the simulator) and
> > am otherwise using the OFED 1.1 tarball for the rest of the stuff.
> > I suspect I have a problem with OpenSM not being built correctly to use
> > the simulator.
> >
> > Does anyone have a recipe on how to build and install all of these pieces
> > (ie, openib, openSM and ibmgtsim) so that they will work together?
> >
> > I have been just trying to run one of the tests provided with the
> > simulator like this:
> >
> > % cd ~/ibutils/ibmgtsim/tests
> > % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o /usr/local/bin/opensm
> >
> > but we get this sort of output:
> >
> > -I- Using random seed:43204
> > -I- Simulation directory is: /tmp/ibmgtsim.29716
> > -I- Calling IBMgtSim -s 43204 -V 0xA3 -t /root/ibutils/ibmgtsim/tests/IS1-16.top
> > o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l /tmp/ibmgtsim.29716/sim.log
> > -I- Simulator Ready
> > -I- Connecting to the simulator control server:pcplod.americas.sgi.com port:3726
> > 5 
> > -I- Connected to the simulator control server
> > -I- Defined 51 guids
> > -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} {0x0002c9000000000a
> >  2}
> > -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
> > -I- Waiting for OpenSM subnet up ...
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> osm_vendor_open_port: 
> > ERR 5422: Unable to find requested CA guid 0x2c90000000009
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: ERR 5
> > 424: Unable to Open Port 0x2c90000000009
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> osm_sm_mad_ctrl_bind: 
> > ERR 3118: Vendor specific bind failed
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 2E10:
> >  SM MAD Controller bind failed (IB_ERROR)
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> > -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> osm_sa_mad_ctrl_unbind
> > : ERR 1A11: No previous bind
> > -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >
> > Thank you.
> >
> > Chris
> > SGI Network Engineering
> >   
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From chrise at sgi.com  Thu Dec 21 09:02:19 2006
From: chrise at sgi.com (Chris Elmquist)
Date: Thu, 21 Dec 2006 11:02:19 -0600
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <458AA161.5090708@mellanox.co.il>
References: <20061220165624.GL31149@sgi.com> <458AA161.5090708@mellanox.co.il>
Message-ID: <20061221170219.GH19625@sgi.com>

Hi Guys...

Thank you very much for the recipe.  We actually had a success getting
it to go just after posting to the list but these instructions will now
confirm whether we did it the right way or not.

Are there any guidelines for how big of a network the simulator can
deal with?  Maybe something that relates it to available memory on the
platform it is running or other resource issues?  We threw one model
at it already which tipped it over but we are certainly not sure we are
using it the right way yet.

Thanks again.  We hope to be activate participants in this space going
forward and as soon as we know what we are doing, we'll feed it back to
the group.

Chris 

On Thursday (12/21/2006 at 04:59PM +0200), Eitan Zahavi wrote:
> Hi Chris,
> 
> Sorry for my late response on this:
> 
> The simulator is a standalone "server" where clients connect to it 
> through a TCP/IP socket.
> 
> OpenSM which is not built with "sim" vendor (using --with-osmv=sim 
> --with-sim=<prefix used for ibmgtsim install>)
> will not try to connect to the simulator but will go to the real IB 
> network instead.
> 
> So you need a second "simulator" install of OpenSM.
> You can simply clone the GIT tree and
> ./autogen.sh
> ./configure --with-osmv=sim --with-sim=<prefix used for ibmgtsim 
> install> --prefix=<somewhere>
> make
> make install
> 
> RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o 
> <somewhere>/bin/opensm
> 
> Actually OsmTest is a test that currently fail (due to last changes in 
> InformInfo),
> but any other *.check.tcl/*.sim.tcl pair should work.
> 
> Eitan
> 
> 
> Chris Elmquist wrote:
> >Folks,
> >
> >I am trying to build and run IBMgtsim so that I can explore some different
> >topologies and system sizes.  But I am having a lot of trouble getting
> >OpenSM to work with the simulator.
> >
> >I pulled down Eitan's ibutils git tree (to get the simulator) and
> >am otherwise using the OFED 1.1 tarball for the rest of the stuff.
> >I suspect I have a problem with OpenSM not being built correctly to use
> >the simulator.
> >
> >Does anyone have a recipe on how to build and install all of these pieces
> >(ie, openib, openSM and ibmgtsim) so that they will work together?
> >
> >I have been just trying to run one of the tests provided with the
> >simulator like this:
> >
> >% cd ~/ibutils/ibmgtsim/tests
> >% RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o 
> >/usr/local/bin/opensm
> >
> >but we get this sort of output:
> >
> >-I- Using random seed:43204
> >-I- Simulation directory is: /tmp/ibmgtsim.29716
> >-I- Calling IBMgtSim -s 43204 -V 0xA3 -t 
> >/root/ibutils/ibmgtsim/tests/IS1-16.top
> >o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l 
> >/tmp/ibmgtsim.29716/sim.log
> >-I- Simulator Ready
> >-I- Connecting to the simulator control server:pcplod.americas.sgi.com 
> >port:3726
> >5 
> >-I- Connected to the simulator control server
> >-I- Defined 51 guids
> >-I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} 
> >{0x0002c9000000000a
> > 2}
> >-I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
> >-I- Waiting for OpenSM subnet up ...
> >-I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> 
> >osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 
> >0x2c90000000009
> >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >-I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: 
> >ERR 5
> >424: Unable to Open Port 0x2c90000000009
> >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >-I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> 
> >osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
> >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >-I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 
> >2E10:
> > SM MAD Controller bind failed (IB_ERROR)
> >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >-I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> 
> >osm_sa_mad_ctrl_unbind
> >: ERR 1A11: No previous bind
> >-I- New 1 events of /tmp/ibmgtsim.29716/osm.log
> >
> >Thank you.
> >
> >Chris
> >SGI Network Engineering
> >  

-- 
Chris Elmquist          mailto:chrise at sgi.com      (651)683-3093
                        Silicon Graphics, Inc.     Eagan, MN


From eitan at mellanox.co.il  Thu Dec 21 11:09:24 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 21 Dec 2006 21:09:24 +0200
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <20061221170219.GH19625@sgi.com>
References: <20061220165624.GL31149@sgi.com>
	<458AA161.5090708@mellanox.co.il> <20061221170219.GH19625@sgi.com>
Message-ID: <458ADBE4.708@mellanox.co.il>

Chris Elmquist wrote:
> Hi Guys...
>
> Thank you very much for the recipe.  We actually had a success getting
> it to go just after posting to the list but these instructions will now
> confirm whether we did it the right way or not.
>
> Are there any guidelines for how big of a network the simulator can
> deal with?  Maybe something that relates it to available memory on the
> platform it is running or other resource issues?  We threw one model
> at it already which tipped it over but we are certainly not sure we are
> using it the right way yet.
>   
I was able to simulate 10K nodes in the past.
What I did to get there was to use two machines: one for the simulator 
and one for the SM.
I also used 64bit (x86_64) machines to avoid the ~3GB data limit.
> Thanks again.  We hope to be activate participants in this space going
> forward and as soon as we know what we are doing, we'll feed it back to
> the group.
>
> Chris 
>
> On Thursday (12/21/2006 at 04:59PM +0200), Eitan Zahavi wrote:
>   
>> Hi Chris,
>>
>> Sorry for my late response on this:
>>
>> The simulator is a standalone "server" where clients connect to it 
>> through a TCP/IP socket.
>>
>> OpenSM which is not built with "sim" vendor (using --with-osmv=sim 
>> --with-sim=<prefix used for ibmgtsim install>)
>> will not try to connect to the simulator but will go to the real IB 
>> network instead.
>>
>> So you need a second "simulator" install of OpenSM.
>> You can simply clone the GIT tree and
>> ./autogen.sh
>> ./configure --with-osmv=sim --with-sim=<prefix used for ibmgtsim 
>> install> --prefix=<somewhere>
>> make
>> make install
>>
>> RunSimTest -f OsmTest.sim.tcl -c OsmTest.check.tcl -t IS1-16.topo -o 
>> <somewhere>/bin/opensm
>>
>> Actually OsmTest is a test that currently fail (due to last changes in 
>> InformInfo),
>> but any other *.check.tcl/*.sim.tcl pair should work.
>>
>> Eitan
>>
>>
>> Chris Elmquist wrote:
>>     
>>> Folks,
>>>
>>> I am trying to build and run IBMgtsim so that I can explore some different
>>> topologies and system sizes.  But I am having a lot of trouble getting
>>> OpenSM to work with the simulator.
>>>
>>> I pulled down Eitan's ibutils git tree (to get the simulator) and
>>> am otherwise using the OFED 1.1 tarball for the rest of the stuff.
>>> I suspect I have a problem with OpenSM not being built correctly to use
>>> the simulator.
>>>
>>> Does anyone have a recipe on how to build and install all of these pieces
>>> (ie, openib, openSM and ibmgtsim) so that they will work together?
>>>
>>> I have been just trying to run one of the tests provided with the
>>> simulator like this:
>>>
>>> % cd ~/ibutils/ibmgtsim/tests
>>> % RunSimTest -c OsmTest.check.tcl -f OsmTest.sim.tcl -t IS1-16.topo -o 
>>> /usr/local/bin/opensm
>>>
>>> but we get this sort of output:
>>>
>>> -I- Using random seed:43204
>>> -I- Simulation directory is: /tmp/ibmgtsim.29716
>>> -I- Calling IBMgtSim -s 43204 -V 0xA3 -t 
>>> /root/ibutils/ibmgtsim/tests/IS1-16.top
>>> o -f /root/ibutils/ibmgtsim/tests/OsmTest.sim.tcl -l 
>>> /tmp/ibmgtsim.29716/sim.log
>>> -I- Simulator Ready
>>> -I- Connecting to the simulator control server:pcplod.americas.sgi.com 
>>> port:3726
>>> 5 
>>> -I- Connected to the simulator control server
>>> -I- Defined 51 guids
>>> -I- Node H-1 data: 0x0002c90000000008 {0x0002c90000000009 1} 
>>> {0x0002c9000000000a
>>> 2}
>>> -I- Starting: /usr/local/bin/opensm -g 0x0002c90000000009  ...
>>> -I- Waiting for OpenSM subnet up ...
>>> -I- OpenSM Event:ERR Dec 20 10:53:09 470415 [5100E100] -> 
>>> osm_vendor_open_port: ERR 5422: Unable to find requested CA guid 
>>> 0x2c90000000009
>>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>>> -I- OpenSM Event:ERR Dec 20 10:53:09 470419 [5100E100] -> osm_vendor_bind: 
>>> ERR 5
>>> 424: Unable to Open Port 0x2c90000000009
>>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>>> -I- OpenSM Event:ERR Dec 20 10:53:09 470422 [5100E100] -> 
>>> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed
>>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>>> -I- OpenSM Event:ERR Dec 20 10:53:09 470427 [5100E100] -> osm_sm_bind: ERR 
>>> 2E10:
>>> SM MAD Controller bind failed (IB_ERROR)
>>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>>> -I- OpenSM Event:ERR Dec 20 10:53:09 470434 [5100E100] -> 
>>> osm_sa_mad_ctrl_unbind
>>> : ERR 1A11: No previous bind
>>> -I- New 1 events of /tmp/ibmgtsim.29716/osm.log
>>>
>>> Thank you.
>>>
>>> Chris
>>> SGI Network Engineering
>>>  
>>>       
>
>   


From eitan at mellanox.co.il  Thu Dec 21 11:10:32 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 21 Dec 2006 21:10:32 +0200
Subject: [openib-general] OpenSM/osm_ucast_mgr.c: In
 osm_ucast_mgr_set_fwd_table, always reset port state change when set
In-Reply-To: <1166710089.4519.112824.camel@hal.voltaire.com>
References: <1166710089.4519.112824.camel@hal.voltaire.com>
Message-ID: <458ADC28.80305@mellanox.co.il>

Good catch.
Hal Rosenstock wrote:
> OpenSM/osm_ucast_mgr.c: In osm_ucast_mgr_set_fwd_table, always reset
> port state change when set
>
> Signed-off-by: Hal Rosenstock <halr at voltaire.com>
>
> diff --git a/osm/opensm/osm_ucast_mgr.c b/osm/opensm/osm_ucast_mgr.c
> index f663d2d..f546c5f 100644
> --- a/osm/opensm/osm_ucast_mgr.c
> +++ b/osm/opensm/osm_ucast_mgr.c
> @@ -922,7 +922,7 @@ osm_ucast_mgr_set_fwd_table(
>    else
>      life_state = (p_mgr->p_subn->opt.packet_life_time <<3 ) & 0xf8;
>  
> -  if (life_state != si.life_state)
> +  if ( (life_state != si.life_state) || ib_switch_info_get_state_change( &si ) )
>    {
>      set_swinfo_require = TRUE;
>      si.life_state = life_state;
>
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Thu Dec 21 11:14:11 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 21 Dec 2006 21:14:11 +0200
Subject: [openib-general] [PATCH] osm: fix simulator vendor not initializing
 complete mad address
Message-ID: <458ADD03.4020909@mellanox.co.il>

Hi Hal,

This fix resolves the issue I have seen on osmtest InformInfo flow.
I am still not sure it is correct to compare sender address in the SA 
InformInfo receiver
by simply comparing the entire osm_mad_addr structure. But anyway, at 
least the simulator now behaves like the rest of the stacks.

The fix makes sure we init the complete mad address structure before 
copying the relevant data.

Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
---
 osm/libvendor/osm_vendor_mlx_sim.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/osm/libvendor/osm_vendor_mlx_sim.c 
b/osm/libvendor/osm_vendor_mlx_sim.c
index 4692df0..d3e6eeb 100644
--- a/osm/libvendor/osm_vendor_mlx_sim.c
+++ b/osm/libvendor/osm_vendor_mlx_sim.c
@@ -381,6 +381,7 @@ __osmv_ibms_mad_addr_to_osm_addr(
   IN uint8_t is_smi,
   OUT osm_mad_addr_t *p_osm_addr)
 {
+  memset(p_osm_addr, 0, sizeof(osm_mad_addr_t));
   p_osm_addr->dest_lid = cl_hton16(p_ibms_addr->slid);
   p_osm_addr->static_rate = 0;
   p_osm_addr->path_bits = 0;
-- 
1.4.4.1.GIT


From eitan at mellanox.co.il  Thu Dec 21 11:16:59 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 21 Dec 2006 21:16:59 +0200
Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to return
 error when expected error does not happen
Message-ID: <458ADDAB.80301@mellanox.co.il>

Hi Hal,

I have found that on BAD InformInfo transactions when the osmtest 
expects an error from the SM
it misses returning an error to the calling procedure which will make 
osmtest pass the test.

EZ
Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>

---
 osm/osmtest/osmtest.c |   50 
+++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index b1df333..e1c64ef 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* InformInfoRecord tests */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a BAD - Set Unsubscribe request\n"); 
   memset( &inform_info_opt, 0, sizeof( inform_info_opt ) );
   memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, 
IB_MAD_ATTR_INFORM_INFO_RECORD,
-                       IB_MAD_METHOD_SET, &inform_info_rec_opt,
+                                       IB_MAD_METHOD_SET, 
&inform_info_rec_opt,
                                        &context );
   if ( status == IB_SUCCESS )
+  {
+    status = IB_ERROR;
     goto Exit;
+  }
   else
   {
     osm_log( &p_osmt->log, OSM_LOG_ERROR,
@@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_
              "IS EXPECTED ERROR ^^^^\n");
   }
 
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+              "osmtest_informinfo_request: InformInfoRecord "
+              "Sending a Good - Empty GetTable request\n"); 
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, 
IB_MAD_ATTR_INFORM_INFO_RECORD,
-                       IB_MAD_METHOD_GETTABLE,
+                                                    IB_MAD_METHOD_GETTABLE,
                                        &inform_info_rec_opt, &context );
   if ( status != IB_SUCCESS )
     goto Exit;
 
   /* InformInfo tests */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a BAD - Empty Get request "
+           "(should fail with NO_RECORDS)\n"); 
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
                                        IB_MAD_METHOD_GET, &inform_info_opt,
                                        &context );
   if ( status == IB_SUCCESS )
+  {
+    status = IB_ERROR;
     goto Exit;
+  }
   else
   {
     osm_log( &p_osmt->log, OSM_LOG_ERROR,
@@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_
              "IS EXPECTED ERROR ^^^^\n");
   }
 
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a BAD - Set Unsubscribe request\n"); 
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
                                        IB_MAD_METHOD_SET, &inform_info_opt,
                                        &context );
   if ( status == IB_SUCCESS )
+  {
+    status = IB_ERROR;
     goto Exit;
+  }
   else
   {
     osm_log( &p_osmt->log, OSM_LOG_ERROR,
@@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_
   }
 
   /* Now subscribe */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - Set Subscribe request\n");
   inform_info_opt.subscribe = TRUE;
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
@@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Now unsubscribe (QPN needs to be 1 to work) */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - Set Unsubscribe request\n");
   inform_info_opt.subscribe = FALSE;
   inform_info_opt.qpn = 1;
   memset( &context, 0, sizeof( context ) );
@@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Now subscribe again */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - Set Subscribe request\n");
   inform_info_opt.subscribe = TRUE;
   inform_info_opt.qpn = 1;
   memset( &context, 0, sizeof( context ) );
@@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Subscribe over existing subscription */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - Set Subscribe (again) request\n");
   inform_info_opt.qpn = 0;
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
@@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_
 
   /* More InformInfoRecord tests */
   /* RID lookup (with currently invalid enum) */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - GetTable by GID\n");
   ib_gid_set_default( &inform_info_rec_opt.subscriber_gid,
                       p_osmt->local_port.port_guid );
   inform_info_rec_opt.subscriber_enum = 1;
@@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Enum lookup */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - GetTable (subsriber_enum == 0) request\n");
   inform_info_rec_opt.subscriber_enum = 0;
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, 
IB_MAD_ATTR_INFORM_INFO_RECORD,
@@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Get all InformInfoRecords */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - GetTable (ALL records) request\n");
   memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, 
IB_MAD_ATTR_INFORM_INFO_RECORD,
@@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_
     goto Exit;
 
   /* Cleanup subscriptions before further testing */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+           "osmtest_informinfo_request: InformInfoRecord "
+           "Sending a Good - Set (cleanup all of them) request\n");
   inform_info_opt.subscribe = FALSE;
   inform_info_opt.qpn = 1;
   memset( &context, 0, sizeof( context ) );
-- 
1.4.4.1.GIT


From chrise at sgi.com  Thu Dec 21 11:26:58 2006
From: chrise at sgi.com (Chris Elmquist)
Date: Thu, 21 Dec 2006 13:26:58 -0600
Subject: [openib-general] building and running IBMgtsim?
In-Reply-To: <458ADBE4.708@mellanox.co.il>
References: <20061220165624.GL31149@sgi.com>
	<458AA161.5090708@mellanox.co.il> <20061221170219.GH19625@sgi.com>
	<458ADBE4.708@mellanox.co.il>
Message-ID: <20061221192658.GJ19625@sgi.com>

On Thursday (12/21/2006 at 09:09PM +0200), Eitan Zahavi wrote:
> I was able to simulate 10K nodes in the past.
> What I did to get there was to use two machines: one for the simulator 
> and one for the SM.

OK.  Those are good datapoints.

> I also used 64bit (x86_64) machines to avoid the ~3GB data limit.

We've got that covered...

Thanks.

Chris

-- 
Chris Elmquist          mailto:chrise at sgi.com      (651)683-3093
                        Silicon Graphics, Inc.     Eagan, MN


From halr at voltaire.com  Thu Dec 21 11:39:41 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 14:39:41 -0500
Subject: [openib-general] [PATCH] osm: fix simulator vendor not
 initializing complete mad address
In-Reply-To: <458ADD03.4020909@mellanox.co.il>
References: <458ADD03.4020909@mellanox.co.il>
Message-ID: <1166729980.4519.128300.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-21 at 14:14, Eitan Zahavi wrote:
> Hi Hal,
> 
> This fix resolves the issue I have seen on osmtest InformInfo flow.
> I am still not sure it is correct to compare sender address in the SA 
> InformInfo receiver by simply comparing the entire osm_mad_addr structure. 

I'm not sure either. I will look more into this.

> But anyway, at least the simulator now behaves like the rest of the stacks.
> 
> The fix makes sure we init the complete mad address structure before 
> copying the relevant data.
> 
> Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu Dec 21 11:40:57 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 14:40:57 -0500
Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to
 return error when expected error does not happen
In-Reply-To: <458ADDAB.80301@mellanox.co.il>
References: <458ADDAB.80301@mellanox.co.il>
Message-ID: <1166730056.4519.128359.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote:
> Hi Hal,
> 
> I have found that on BAD InformInfo transactions when the osmtest 
> expects an error from the SM
> it misses returning an error to the calling procedure which will make 
> osmtest pass the test.
> 
> EZ
> Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
> 
> ---
>  osm/osmtest/osmtest.c |   50 
> +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
> index b1df333..e1c64ef 100644
> --- a/osm/osmtest/osmtest.c
> +++ b/osm/osmtest/osmtest.c
> @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* InformInfoRecord tests */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a BAD - Set Unsubscribe request\n"); 
>    memset( &inform_info_opt, 0, sizeof( inform_info_opt ) );
>    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,

This patch is line wrapped here (and maybe other places as well) :-(

-- Hal

> -                       IB_MAD_METHOD_SET, &inform_info_rec_opt,
> +                                       IB_MAD_METHOD_SET, 
> &inform_info_rec_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }
>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_
>               "IS EXPECTED ERROR ^^^^\n");
>    }
>  
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +              "osmtest_informinfo_request: InformInfoRecord "
> +              "Sending a Good - Empty GetTable request\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> -                       IB_MAD_METHOD_GETTABLE,
> +                                                    IB_MAD_METHOD_GETTABLE,
>                                         &inform_info_rec_opt, &context );
>    if ( status != IB_SUCCESS )
>      goto Exit;
>  
>    /* InformInfo tests */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a BAD - Empty Get request "
> +           "(should fail with NO_RECORDS)\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
>                                         IB_MAD_METHOD_GET, &inform_info_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }
>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_
>               "IS EXPECTED ERROR ^^^^\n");
>    }
>  
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a BAD - Set Unsubscribe request\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
>                                         IB_MAD_METHOD_SET, &inform_info_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }
>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_
>    }
>  
>    /* Now subscribe */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - Set Subscribe request\n");
>    inform_info_opt.subscribe = TRUE;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Now unsubscribe (QPN needs to be 1 to work) */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - Set Unsubscribe request\n");
>    inform_info_opt.subscribe = FALSE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );
> @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Now subscribe again */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - Set Subscribe request\n");
>    inform_info_opt.subscribe = TRUE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );
> @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Subscribe over existing subscription */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - Set Subscribe (again) request\n");
>    inform_info_opt.qpn = 0;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_
>  
>    /* More InformInfoRecord tests */
>    /* RID lookup (with currently invalid enum) */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable by GID\n");
>    ib_gid_set_default( &inform_info_rec_opt.subscriber_gid,
>                        p_osmt->local_port.port_guid );
>    inform_info_rec_opt.subscriber_enum = 1;
> @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Enum lookup */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable (subsriber_enum == 0) request\n");
>    inform_info_rec_opt.subscriber_enum = 0;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Get all InformInfoRecords */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable (ALL records) request\n");
>    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Cleanup subscriptions before further testing */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - Set (cleanup all of them) request\n");
>    inform_info_opt.subscribe = FALSE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );


From danb at voltaire.com  Thu Dec 21 11:53:03 2006
From: danb at voltaire.com (Dan Bar Dov)
Date: Thu, 21 Dec 2006 21:53:03 +0200
Subject: [openib-general] iSER target
References: <3857BB049D83424D9DB82753D37CEA551F3559@taurus.voltaire.com>
	<458A7FEA.7070707@dev.mellanox.co.il>
Message-ID: <3857BB049D83424D9DB82753D37CEA5509F469@taurus.voltaire.com>

Verbs. RIP kdapl.

Dan


-----Original Message-----
From: Aviram Gutman [mailto:aviram at dev.mellanox.co.il]
Sent: Thu 12/21/2006 2:36 PM
To: Dan Bar Dov
Cc: vishal; openib-general at openib.org
Subject: Re: [openib-general] iSER target
 
Are you planning to have the iSER target over verbs or kDAPL? Isn't the 
kDAPL development halted?

Aviram

Dan Bar Dov wrote:
> The iser target code in the gen2 branch is functional
> over kdapl. It requires an iscsi target code above it,
> however such an iscsi code is not open.
>
> It was opened as a precursor for an open-source iscsi/iser-target
> project. That project is still in its early stages, and the plan is
> to add iser-target support, loosly based on the open-iser-target 
> code, to the stgt project.
>
> Due to the above, there is no readme/installation guide.
>
> Dan
>
>   
>> -----Original Message-----
>> From: openib-general-bounces at openib.org 
>> [mailto:openib-general-bounces at openib.org] On Behalf Of vishal
>> Sent: Wednesday, December 20, 2006 4:03 AM
>> To: openib-general at openib.org
>> Subject: [openib-general] iSER target
>>
>> Hi,
>>
>>     I would like to confirm if the iSER target code in the gen2 branch
>> is functional. If yes, is there a readme/installation guide 
>> available...
>>
>> Thanks a lot!
>>
>> Vishal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit 
>> http://openib.org/mailman/listinfo/openib-general
>>
>>
>>     
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Thu Dec 21 12:42:51 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 15:42:51 -0500
Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to
 return error when expected error does not happen
In-Reply-To: <1166730056.4519.128359.camel@hal.voltaire.com>
References: <458ADDAB.80301@mellanox.co.il>
	<1166730056.4519.128359.camel@hal.voltaire.com>
Message-ID: <1166733770.4519.131252.camel@hal.voltaire.com>

Hi again Eitan,

On Thu, 2006-12-21 at 14:40, Hal Rosenstock wrote:
> Hi Eitan,
> 
> On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote:
> > Hi Hal,
> > 
> > I have found that on BAD InformInfo transactions when the osmtest 
> > expects an error from the SM
> > it misses returning an error to the calling procedure which will make 
> > osmtest pass the test.
> > 
> > EZ
> > Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
> > 
> > ---
> >  osm/osmtest/osmtest.c |   50 
> > +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 files changed, 48 insertions(+), 2 deletions(-)
> > 
> > diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
> > index b1df333..e1c64ef 100644
> > --- a/osm/osmtest/osmtest.c
> > +++ b/osm/osmtest/osmtest.c
> > @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* InformInfoRecord tests */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a BAD - Set Unsubscribe request\n"); 
> >    memset( &inform_info_opt, 0, sizeof( inform_info_opt ) );
> >    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, 
> > IB_MAD_ATTR_INFORM_INFO_RECORD,
> 
> This patch is line wrapped here (and maybe other places as well) :-(

Never mind. I nursed it through. Other comments to follow...

-- Hal

> -- Hal
> 
> > -                       IB_MAD_METHOD_SET, &inform_info_rec_opt,
> > +                                       IB_MAD_METHOD_SET, 
> > &inform_info_rec_opt,
> >                                         &context );
> >    if ( status == IB_SUCCESS )
> > +  {
> > +    status = IB_ERROR;
> >      goto Exit;
> > +  }
> >    else
> >    {
> >      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> > @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_
> >               "IS EXPECTED ERROR ^^^^\n");
> >    }
> >  
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +              "osmtest_informinfo_request: InformInfoRecord "
> > +              "Sending a Good - Empty GetTable request\n"); 
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, 
> > IB_MAD_ATTR_INFORM_INFO_RECORD,
> > -                       IB_MAD_METHOD_GETTABLE,
> > +                                                    IB_MAD_METHOD_GETTABLE,
> >                                         &inform_info_rec_opt, &context );
> >    if ( status != IB_SUCCESS )
> >      goto Exit;
> >  
> >    /* InformInfo tests */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a BAD - Empty Get request "
> > +           "(should fail with NO_RECORDS)\n"); 
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> >                                         IB_MAD_METHOD_GET, &inform_info_opt,
> >                                         &context );
> >    if ( status == IB_SUCCESS )
> > +  {
> > +    status = IB_ERROR;
> >      goto Exit;
> > +  }
> >    else
> >    {
> >      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> > @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_
> >               "IS EXPECTED ERROR ^^^^\n");
> >    }
> >  
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a BAD - Set Unsubscribe request\n"); 
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> >                                         IB_MAD_METHOD_SET, &inform_info_opt,
> >                                         &context );
> >    if ( status == IB_SUCCESS )
> > +  {
> > +    status = IB_ERROR;
> >      goto Exit;
> > +  }
> >    else
> >    {
> >      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> > @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_
> >    }
> >  
> >    /* Now subscribe */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - Set Subscribe request\n");
> >    inform_info_opt.subscribe = TRUE;
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> > @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Now unsubscribe (QPN needs to be 1 to work) */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - Set Unsubscribe request\n");
> >    inform_info_opt.subscribe = FALSE;
> >    inform_info_opt.qpn = 1;
> >    memset( &context, 0, sizeof( context ) );
> > @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Now subscribe again */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - Set Subscribe request\n");
> >    inform_info_opt.subscribe = TRUE;
> >    inform_info_opt.qpn = 1;
> >    memset( &context, 0, sizeof( context ) );
> > @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Subscribe over existing subscription */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - Set Subscribe (again) request\n");
> >    inform_info_opt.qpn = 0;
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> > @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_
> >  
> >    /* More InformInfoRecord tests */
> >    /* RID lookup (with currently invalid enum) */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - GetTable by GID\n");
> >    ib_gid_set_default( &inform_info_rec_opt.subscriber_gid,
> >                        p_osmt->local_port.port_guid );
> >    inform_info_rec_opt.subscriber_enum = 1;
> > @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Enum lookup */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - GetTable (subsriber_enum == 0) request\n");
> >    inform_info_rec_opt.subscriber_enum = 0;
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, 
> > IB_MAD_ATTR_INFORM_INFO_RECORD,
> > @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Get all InformInfoRecords */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - GetTable (ALL records) request\n");
> >    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
> >    memset( &context, 0, sizeof( context ) );
> >    status = osmtest_informinfo_request( p_osmt, 
> > IB_MAD_ATTR_INFORM_INFO_RECORD,
> > @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_
> >      goto Exit;
> >  
> >    /* Cleanup subscriptions before further testing */
> > +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> > +           "osmtest_informinfo_request: InformInfoRecord "
> > +           "Sending a Good - Set (cleanup all of them) request\n");
> >    inform_info_opt.subscribe = FALSE;
> >    inform_info_opt.qpn = 1;
> >    memset( &context, 0, sizeof( context ) );
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From Ashish.Batwara at lsi.com  Thu Dec 21 13:39:00 2006
From: Ashish.Batwara at lsi.com (Batwara, Ashish)
Date: Thu, 21 Dec 2006 14:39:00 -0700
Subject: [openib-general] opensm
Message-ID: <01B9E81EECACE94DBBD0A556E768FB8A0115A12F@NAMAIL2.ad.lsil.com>

Thanks Vu,
This seems to be working.

Thanks
Ashish

-----Original Message-----
From: Vu Pham [mailto:vuhuong at mellanox.com] 
Sent: Wednesday, December 20, 2006 3:23 PM
To: Batwara, Ashish
Cc: Hal Rosenstock; ishai at mellanox.co.il; openib-general at openib.org
Subject: Re: [openib-general] opensm

Hi Ashish,

> Hi,
> Please see the information below
> 
> This is what I did:
> /etc/init.d/openibd start
> /etc/init.d/opensmd  start
> modprobe ib_srp
> 
> Issued the command /usr/local/ofed/sbin/ibsrpdm -c    to get the
> information about target and used them in 
> 

By default without -d option, ibsrpdm will use 
/dev/infiniband/umad0  -- with corresponding to port 1 of mthca0

> echo id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4,
>  
>
dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b8114
> 6a1 > /sys/class/infiniband_srp/srp-mthca0-1/add_target

This is correct by using srp-mthca0-1; however, I got this 
from your previous email which you reported *I am seeing the 
error " Got failed path rec status -110 " on Linux console*

echo 
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 
 > /sys/class/infiniband_srp/srp-mthca0-2/add_target

You used port 2 of mthca0 here ie. srp-mthca0-2; therefore, 
you got pathrecord failure

Please retry:
0. Make sure you connect port 1 of host hca to target (since 
you connect them directly. Port 2 work as well but you have 
to use the umad1 and srp-mthca0-2 for steps 1,2 below)
1. ibsrpdm -c -d /dev/infiniband/umad0
2. echo whatever target discover to srp-mthca0-1

-vu
> 
> Yes, earlier I had silverstorm switch which was running SM but now I
> have taken that out and directly connecting the target and host.
> 
> I have only one port connected between the host and the target. 
> The reason behind link is not stable is that I am restarting and
> stopping again and again, as this does not seem to be working and I
did
> not know the issue until I looked at the console log which was
> indicating "Got failed path rec status -110" and after seeing that I
> searched on goggle and found that
>
"https://lists.scl.ameslab.gov/pipermail/sc05-ib/2005-November/000383.ht
> ml" it seems to be a bug with 64-bit machine.
> BTW, my linux server is 64-bit.
> When I hooked up 32-bit server running OFED-1.1, I see my target
> discovered with the same procedure.
> 
> So, whole question is that what is the fix for issue "Got failed path
> rec status -110" on 64-bit machine.
> 
> Thanks
> Ashish
> 
> -----Original Message-----
> From: Hal Rosenstock [mailto:halr at voltaire.com] 
> Sent: Tuesday, December 19, 2006 10:35 PM
> To: Batwara, Ashish
> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
> Subject: RE: [openib-general] opensm
> 
> On Tue, 2006-12-19 at 18:22, Batwara, Ashish wrote:
>> Hi,
>> Please look towards the end of the attached file.
> 
> What options are you starting opensm with ? What is the command line ?
> 
> Also, it looks like (at least at one point) you have another SM on the
> subnet. What is the make (vendor) for your switch ?
> 
> I see many SM port is DOWN. What is going on with this port ? Why is
the
> physical link not LinkUp and stable ? That is the main issue and is
> likely why the SubnGet of NodeInfo is not being responded to.
> 
> -- Hal
> 
>> Thanks
>> Ashish
>>
>> -----Original Message-----
>> From: Hal Rosenstock [mailto:halr at voltaire.com] 
>> Sent: Tuesday, December 19, 2006 5:06 PM
>> To: Batwara, Ashish
>> Cc: Eitan Zahavi; ishai at mellanox.co.il; openib-general at openib.org
>> Subject: Re: [openib-general] opensm
>>
>> Ashish,
>>
>> On Tue, 2006-12-19 at 17:43, Batwara, Ashish wrote:
>>> Hi,
>>>
>>> Here is the info that you have asked. I am seeing the Subnet manager
>>> is up now having the port active. But server is not able to discover
>>> the target. I am seeing the error "Got failed path rec status -110"
> on
>>> Linux console. 
>> That means the request for an SA PathRecord from the initiator to the
>> target failed (-110 is ETIMEDOUT). Are you sure the target is up
>> (ACTIVE) on the subnet ? If it is, can you send the opensm log ?
>>
>> -- Hal
>>
>>> Below are the output of different commands. I am using following to
>>> discover the target:
>>>
>>>  
>>>
>>> /etc/init.d/opensmd start
>>>
>>> /etc/init.d/openibd start
>>>
>>> modprobe ib_srp
>>>
>>> echo
>>>
>
id_ext=200300A0B811C847,ioc_guid=00a0b8020022cd27,dgid=fe800000000000000
>> 002c9020022cd26,pkey=ffff,service_id=200300a0b811c847 >
>> /sys/class/infiniband_srp/srp-mthca0-2/add_target 
>>>  
>>>
>>>  
>>>
>>> [root at p49 ~]# ibv_devinfo
>>>
>>> hca_id: mthca0
>>>
>>>         fw_ver:                         5.1.400
>>>
>>>         node_guid:                      0002:c902:0022:cce0
>>>
>>>         sys_image_guid:                 0002:c902:0022:cce3
>>>
>>>         vendor_id:                      0x02c9
>>>
>>>         vendor_part_id:                 25218
>>>
>>>         hw_ver:                         0xA0
>>>
>>>         board_id:                       MT_0370130002
>>>
>>>         phys_port_cnt:                  2
>>>
>>>                 port:   1
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>                 port:   2
>>>
>>>                         state:                  PORT_ACTIVE (4)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             2048 (4)
>>>
>>>                         sm_lid:                 1
>>>
>>>                         port_lid:               1
>>>
>>>                         port_lmc:               0x00
>>> hca_id: mthca1
>>>
>>>         fw_ver:                         5.1.400
>>>
>>>         node_guid:                      0002:c902:0022:cd2c
>>>
>>>         sys_image_guid:                 0002:c902:0022:cd2f
>>>
>>>         vendor_id:                      0x02c9
>>>
>>>         vendor_part_id:                 25218
>>>
>>>         hw_ver:                         0xA0
>>>
>>>         board_id:                       MT_0370130002
>>>
>>>         phys_port_cnt:                  2
>>>
>>>                 port:   1
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>                 port:   2
>>>
>>>                         state:                  PORT_DOWN (1)
>>>
>>>                         max_mtu:                2048 (4)
>>>
>>>                         active_mtu:             512 (2)
>>>
>>>                         sm_lid:                 0
>>>
>>>                         port_lid:               0
>>>
>>>                         port_lmc:               0x00
>>>
>>>  
>>>
>>>  
>>>
>>> [root at p49 ~]# uname -a
>>>
>>> Linux p49.ks.lsil.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:24:31
>>> EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>  
>>>
>>> [root at p49 ~]# cat /etc/infiniband/info
>>>
>>> #!/bin/bash
>>>
>>>  
>>>
>>> echo prefix=/usr/local/ofed
>>>
>>> echo Kernel=2.6.9-42.0.3.ELsmp
>>>
>>> echo
>>>
>>> echo "Configure options: --with-dapl --with-ipoibtools
> --with-libibcm
>>> --with-libibcommon --with-libibmad --with-libibumad
> --with-libibverbs
>>> --with-libipathverbs --with-libmthca --with-opensm --with-librdmacm
>>> --with-libsdp --with-openib-diags --with-srptools --with-mstflint
>>> --with-perftest --with-tvflash --with-ipath_inf-mod --with-ipoib-mod
>>> --with-mthca-mod --with-sdp-mod --with-srp-mod --with-core-mod
>>> --with-user_mad-mod --with-user_access-mod --with-addr_trans-mod"
>>>
>>> echo
>>>
>>>  
>>>
>>> OFED Version: OFED-1.1
>>
>>
>>> Thanks
>>>
>>> Ashish
>>>
>>> -----Original Message-----
>>> From: Eitan Zahavi [mailto:eitan at mellanox.co.il] 
>>> Sent: Tuesday, December 19, 2006 5:18 AM
>>> To: Batwara, Ashish
>>> Cc: ishai at mellanox.co.il; openib-general at openib.org
>>> Subject: Re: [openib-general] opensm
>>>
>>>  
>>>
>>> Hi Ashish,
>>>
>>>  
>>>
>>> SRP people say they have no such error message.
>>>
>>> OpenSM does. So I take it back.
>>>
>>>  
>>>
>>> Ashish,
>>>
>>> Please provide more into:
>>>
>>>  
>>>
>>> 1. ibv_devinfo
>>>
>>> 2. Version of code you are using
>>>
>>> 3. Command line you use for starting opensm
>>>
>>> 4. /var/log/osm.log
>>>
>>>  
>>>
>>> Thanks and sorry for the confusion.
>>>
>>>  
>>>
>>> EZ
>>>
>>>  
>>>
>>> Eitan Zahavi wrote:
>>>
>>>> This is not an OpenSM issue.
>>>> Forwarded to the SRP people.
>>>> EZ
>>>> Batwara, Ashish wrote:
>>>>   
>>>>> Hi,
>>>>> I am trying to run opensm on Linux server. It has two HCAs
>>> (4-ports) and
>>>
>>>>> connected to IB Switch. ibnodes command displays the information
>>> about
>>>
>>>>> the Switch ports and HCA ports.
>>>>> When I start opensm, I see in /var/log/messages "Starting
>>> srp_daemon"
>>>
>>>>> for all the 4 ports and immediately after I see "failed
> srp_daemon"
>>> for
>>>
>>>>> all the ports and the displays "SM Port is down".
>>>>> I tried several times and even rebooted the server few times but
> no
>>>>> luck.
>>>>> Does anybody know what this problem is?
>>>>> Thanks
>>>>> Ashish
>>>>> _______________________________________________
>>>>> openib-general mailing list
>>>>> openib-general at openib.org
>>>>> http://openib.org/mailman/listinfo/openib-general
>>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>>>   
>>>>>     
>>>> _______________________________________________
>>>> openib-general mailing list
>>>> openib-general at openib.org
>>>> http://openib.org/mailman/listinfo/openib-general
>>>> To unsubscribe, please visit
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>>>   
>>>  
>>>
>>>
>>>
>>>
> ______________________________________________________________________
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit
>> http://openib.org/mailman/listinfo/openib-general
>>
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Thu Dec 21 12:55:12 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 15:55:12 -0500
Subject: [openib-general] [PATCH] osm: fix osmtest InformInfo flow to
 return error when expected error does not happen
In-Reply-To: <458ADDAB.80301@mellanox.co.il>
References: <458ADDAB.80301@mellanox.co.il>
Message-ID: <1166734511.4519.131808.camel@hal.voltaire.com>

Hi Eitan,

On Thu, 2006-12-21 at 14:16, Eitan Zahavi wrote:
> Hi Hal,
> 
> I have found that on BAD InformInfo transactions when the osmtest 
> expects an error from the SM
> it misses returning an error to the calling procedure which will make 
> osmtest pass the test.
> 
> EZ
> Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
> 
> ---
>  osm/osmtest/osmtest.c |   50 

Thanks. Applied (after fixing up the whitespace and adapting to the
latest osmtest/osmtest.c). Some additional comments below:

> +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 files changed, 48 insertions(+), 2 deletions(-)
> 
> diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
> index b1df333..e1c64ef 100644
> --- a/osm/osmtest/osmtest.c
> +++ b/osm/osmtest/osmtest.c
> @@ -5813,14 +5813,20 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* InformInfoRecord tests */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a BAD - Set Unsubscribe request\n"); 
>    memset( &inform_info_opt, 0, sizeof( inform_info_opt ) );
>    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> -                       IB_MAD_METHOD_SET, &inform_info_rec_opt,
> +                                       IB_MAD_METHOD_SET, 
> &inform_info_rec_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }

Dang; missed that again...  Yevgeny spanked me on this before...

>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5828,20 +5834,30 @@ osmtest_validate_against_db( IN osmtest_
>               "IS EXPECTED ERROR ^^^^\n");
>    }
>  
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +              "osmtest_informinfo_request: InformInfoRecord "
> +              "Sending a Good - Empty GetTable request\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> -                       IB_MAD_METHOD_GETTABLE,
> +                                                    IB_MAD_METHOD_GETTABLE,
>                                         &inform_info_rec_opt, &context );
>    if ( status != IB_SUCCESS )
>      goto Exit;
>  
>    /* InformInfo tests */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a BAD - Empty Get request "
> +           "(should fail with NO_RECORDS)\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
>                                         IB_MAD_METHOD_GET, &inform_info_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }
>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5849,12 +5865,18 @@ osmtest_validate_against_db( IN osmtest_
>               "IS EXPECTED ERROR ^^^^\n");
>    }
>  
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a BAD - Set Unsubscribe request\n"); 
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
>                                         IB_MAD_METHOD_SET, &inform_info_opt,
>                                         &context );
>    if ( status == IB_SUCCESS )
> +  {
> +    status = IB_ERROR;
>      goto Exit;
> +  }
>    else
>    {
>      osm_log( &p_osmt->log, OSM_LOG_ERROR,
> @@ -5863,6 +5885,9 @@ osmtest_validate_against_db( IN osmtest_
>    }
>  
>    /* Now subscribe */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a Good - Set Subscribe request\n");
>    inform_info_opt.subscribe = TRUE;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> @@ -5872,6 +5897,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Now unsubscribe (QPN needs to be 1 to work) */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a Good - Set Unsubscribe request\n");
>    inform_info_opt.subscribe = FALSE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );
> @@ -5882,6 +5910,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Now subscribe again */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a Good - Set Subscribe request\n");
>    inform_info_opt.subscribe = TRUE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );
> @@ -5892,6 +5923,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Subscribe over existing subscription */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a Good - Set Subscribe (again) request\n");
>    inform_info_opt.qpn = 0;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
> @@ -5902,6 +5936,9 @@ osmtest_validate_against_db( IN osmtest_
>  
>    /* More InformInfoRecord tests */
>    /* RID lookup (with currently invalid enum) */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable by GID\n");
>    ib_gid_set_default( &inform_info_rec_opt.subscriber_gid,
>                        p_osmt->local_port.port_guid );
>    inform_info_rec_opt.subscriber_enum = 1;
> @@ -5913,6 +5950,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Enum lookup */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable (subsriber_enum == 0) request\n");
                                          subscriber_enum
>    inform_info_rec_opt.subscriber_enum = 0;
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> @@ -5922,6 +5962,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Get all InformInfoRecords */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
> +           "Sending a Good - GetTable (ALL records) request\n");
>    memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
>    memset( &context, 0, sizeof( context ) );
>    status = osmtest_informinfo_request( p_osmt, 
> IB_MAD_ATTR_INFORM_INFO_RECORD,
> @@ -5931,6 +5974,9 @@ osmtest_validate_against_db( IN osmtest_
>      goto Exit;
>  
>    /* Cleanup subscriptions before further testing */
> +  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
> +           "osmtest_informinfo_request: InformInfoRecord "
                                           ^^^^^^^^^^
                                           InformInfo
> +           "Sending a Good - Set (cleanup all of them) request\n");
>    inform_info_opt.subscribe = FALSE;
>    inform_info_opt.qpn = 1;
>    memset( &context, 0, sizeof( context ) );

-- Hal


From halr at voltaire.com  Thu Dec 21 12:59:42 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 21 Dec 2006 15:59:42 -0500
Subject: [openib-general] [PATCH]osmtest/osmtest.c: Add more
 InformInfo/InformInfoRecord tests
Message-ID: <1166734781.4519.132032.camel@hal.voltaire.com>

osmtest/osmtest.c: Add more InformInfo/InformInfoRecord tests

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index 355a6f9..6afa899 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -73,6 +73,7 @@ typedef struct _osmtest_inform_info
 {
   boolean_t subscribe;
   ib_net32_t qpn;
+  ib_net16_t trap;
 } osmtest_inform_info_t;
 
 typedef struct _osmtest_inform_info_rec
@@ -4890,6 +4891,11 @@ osmtest_informinfo_request(
       rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8;
       user.comp_mask |= IB_IIR_COMPMASK_QPN;
     }
+    if (p_inform_info_opt->trap)
+    {
+      rec.g_or_v.generic.trap_num = cl_hton16(p_inform_info_opt->trap);
+      user.comp_mask |= IB_IIR_COMPMASK_TRAPNUMB;
+    }
     user.p_attr = &rec;
   }
   user.method = method;
@@ -5973,12 +5979,63 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  /* Another subscription */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+	   "osmtest_informinfo_request: InformInfo "
+	   "Sending another Good - Set Subscribe (again) request\n");
+  inform_info_opt.qpn = 0;
+  inform_info_opt.trap = 0x1234;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET, &inform_info_opt,
+                                       &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Get all InformInfoRecords again */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+	   "osmtest_informinfo_request: InformInfoRecord "
+	   "Sending a Good - GetTable (ALL records) request\n");
+  memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE,
+                                       &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
   /* Cleanup subscriptions before further testing */
+  /* Does order of deletion matter ? Test this !!! */
   osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
 	   "osmtest_informinfo_request: InformInfo "
-	   "Sending a Good - Set (cleanup all of them) request\n");
+	   "Sending a Good - Set (cleanup) request\n");
+  inform_info_opt.subscribe = FALSE;
+  inform_info_opt.qpn = 1;
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
+                                       IB_MAD_METHOD_SET,
+                                       &inform_info_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  /* Get all InformInfoRecords again */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+	   "osmtest_informinfo_request: InformInfoRecord "
+	   "Sending a Good - GetTable (ALL records) request\n");
+  memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE,
+                                       &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+	   "osmtest_informinfo_request: InformInfo"
+	   "Sending a Good - Set (cleanup) request\n");
   inform_info_opt.subscribe = FALSE;
   inform_info_opt.qpn = 1;
+  inform_info_opt.trap = 0;
   memset( &context, 0, sizeof( context ) );
   status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO,
                                        IB_MAD_METHOD_SET,
@@ -5986,6 +6043,18 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  /* Get all InformInfoRecords a final time */
+  osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+	   "osmtest_informinfo_request: InformInfoRecord "
+	   "Sending a Good - GetTable (ALL records) request\n");
+  memset( &inform_info_rec_opt, 0, sizeof( inform_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_informinfo_request( p_osmt, IB_MAD_ATTR_INFORM_INFO_RECORD,
+                                       IB_MAD_METHOD_GETTABLE,
+                                       &inform_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
   if (lmc != 0)
   {
     test_lid = cl_ntoh16( p_osmt->local_port.lid + 1 );


From eitan at sw053.yok.mtl.com  Thu Dec 21 21:10:01 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Fri, 22 Dec 2006 07:10:01 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-22:normal completion
Message-ID: <200612220510.kBM5A1pj018761@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Thu_Dec_21_14:36:22_2006 c3fcbb MOD_FILES=1
ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
Total=396 Pass=395 Fail=1

Pass:
54 Stability IS1-16.topo
54 Pkey IS1-16.topo
54 OsmStress IS1-16.topo
54 Multicast IS1-16.topo
54 LidMgr IS1-16.topo
18 Stability IS3-loop.topo
18 Stability IS3-128.topo
18 Pkey IS3-128.topo
18 OsmStress IS3-128.topo
18 Multicast IS3-loop.topo
18 Multicast IS3-128.topo
17 LidMgr IS3-128.topo

Failures:
1 LidMgr IS3-128.topo


From halr at voltaire.com  Fri Dec 22 06:25:59 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Dec 2006 09:25:59 -0500
Subject: [openib-general] [PATCH 2/2] osmtest/osmtest.c: More SA
	SMInfoRecord tests
Message-ID: <1166797547.4519.181603.camel@hal.voltaire.com>

osmtest/osmtest.c: More SA SMInfoRecord tests

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index 6afa899..0ccc06c 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -69,6 +69,14 @@
 #define POOL_MIN_ITEMS  64
 #define GUID_ARRAY_SIZE 64
 
+typedef struct _osmtest_sm_info_rec
+{
+  ib_net64_t sm_guid;
+  ib_net16_t lid;
+  uint8_t priority;
+  uint8_t sm_state;
+} osmtest_sm_info_rec_t;
+
 typedef struct _osmtest_inform_info
 {
   boolean_t subscribe;
@@ -4756,9 +4764,11 @@ osmtest_get_lft_rec_by_lid( IN osmtest_t
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t
+static ib_api_status_t
 osmtest_sminfo_record_request(
         IN osmtest_t * const p_osmt,
+        IN uint8_t method,
+        IN void *p_options,
         IN OUT osmtest_req_context_t * const p_context )
 {
   ib_api_status_t status = IB_SUCCESS;
@@ -4766,6 +4776,7 @@ osmtest_sminfo_record_request(
   osmv_query_req_t req;
   ib_sminfo_record_t record;
   ib_mad_t *p_mad;
+  osmtest_sm_info_rec_t *p_sm_info_opt;
 
   OSM_LOG_ENTER( &p_osmt->log, osmtest_sminfo_record_request );
 
@@ -4783,6 +4794,29 @@ osmtest_sminfo_record_request(
   p_context->p_osmt = p_osmt;
   user.attr_id = IB_MAD_ATTR_SMINFO_RECORD;
   user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) );
+  p_sm_info_opt = p_options;
+  if (p_sm_info_opt->sm_guid != 0)
+  {
+    record.sm_info.guid = p_sm_info_opt->sm_guid;
+    user.comp_mask |= IB_SMIR_COMPMASK_GUID;
+  }
+  if (p_sm_info_opt->lid != 0)
+  {
+    record.lid = p_sm_info_opt->lid;
+    user.comp_mask |= IB_SMIR_COMPMASK_LID;
+  }
+  if (p_sm_info_opt->priority != 0)
+  {
+    record.sm_info.pri_state = (p_sm_info_opt->priority & 0x0F)<<4;
+    user.comp_mask |= IB_SMIR_COMPMASK_PRIORITY;
+  }
+  if (p_sm_info_opt->sm_state != 0)
+  {
+    record.sm_info.pri_state |= p_sm_info_opt->sm_state & 0x0F;
+    user.comp_mask |= IB_SMIR_COMPMASK_SMSTATE;
+  }
+
+  user.method = method;
   user.p_attr = &record;
 
   req.query_type = OSMV_QUERY_USER_DEFINED;
@@ -4808,9 +4842,12 @@ osmtest_sminfo_record_request(
 
   if( status != IB_SUCCESS )
   {
-    osm_log( &p_osmt->log, OSM_LOG_ERROR,
-             "osmtest_sminfo_record_request: ERR 008D: "
-             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    if (status != IB_INVALID_PARAMETER)
+    {
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_sminfo_record_request: ERR 008D: "
+               "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    }
     if( status == IB_REMOTE_ERROR )
     {
       p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw );
@@ -4831,7 +4868,7 @@ osmtest_sminfo_record_request(
 
 /**********************************************************************
  **********************************************************************/
-ib_api_status_t
+static ib_api_status_t
 osmtest_informinfo_request(
 	IN osmtest_t * const p_osmt,
 	IN ib_net16_t attr_id,
@@ -5553,6 +5590,7 @@ osmtest_validate_against_db( IN osmtest_
 {
   ib_api_status_t status = IB_SUCCESS;
   ib_gid_t portgid, mgid;
+  osmtest_sm_info_rec_t sm_info_rec_opt;
   osmtest_inform_info_t inform_info_opt;
   osmtest_inform_info_rec_t inform_info_rec_opt;
 #ifdef VENDOR_RMPP_SUPPORT
@@ -5563,6 +5601,7 @@ osmtest_validate_against_db( IN osmtest_
 #ifdef DUAL_SIDED_RMPP
   osmv_multipath_req_t request;
 #endif
+  int i; 
 #endif
 
   OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db );
@@ -5812,12 +5851,71 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
-  /* SMInfoRecord test */
+  /* SMInfoRecord tests */
+  memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_SET,
+					  &sm_info_rec_opt, &context );
+  if ( status == IB_SUCCESS )
+  {
+    status = IB_ERROR;
+    goto Exit;
+  }
+  else
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+	     "osmtest_sminfo_request: "
+	     "IS EXPECTED ERROR ^^^^\n");
+  }
+
+  memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE,
+					  &sm_info_rec_opt, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+  sm_info_rec_opt.lid = test_lid;	/* local LID */
   memset( &context, 0, sizeof( context ) );
-  status = osmtest_sminfo_record_request( p_osmt, &context );
+  status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE,
+					  &sm_info_rec_opt, &context );
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  if (portguid != 0)
+  {
+    memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+    sm_info_rec_opt.sm_guid = portguid;	/* local GUID */
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE,
+					    &sm_info_rec_opt, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+  }
+
+  for (i = 1; i < 16; i++)
+  {
+    memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+    sm_info_rec_opt.priority = i;
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE,
+					    &sm_info_rec_opt, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+  }
+
+  for (i = 1; i < 4; i++)
+  {
+    memset( &sm_info_rec_opt, 0, sizeof( sm_info_rec_opt ) );
+    sm_info_rec_opt.sm_state = i;
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_sminfo_record_request( p_osmt, IB_MAD_METHOD_GETTABLE,
+					    &sm_info_rec_opt, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+  }
+
   /* InformInfoRecord tests */
   osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
 	   "osmtest_informinfo_request: InformInfoRecord "


From halr at voltaire.com  Fri Dec 22 06:25:46 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Dec 2006 09:25:46 -0500
Subject: [openib-general] [PATCH 1/2] OpenSM: Better SA SMInfoRecord support
Message-ID: <1166797545.4519.181601.camel@hal.voltaire.com>

OpenSM: Better SA SMInfoRecord support

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_sa_sminfo_record.h b/osm/include/opensm/osm_sa_sminfo_record.h
index 60bfe82..cafc09b 100644
--- a/osm/include/opensm/osm_sa_sminfo_record.h
+++ b/osm/include/opensm/osm_sa_sminfo_record.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2004, 2005 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  *
@@ -85,7 +85,6 @@ BEGIN_C_DECLS
 *	Ranjit Pandit, Intel
 *
 *********/
-
 /****s* OpenSM: SM Info Receiver/osm_smir_rcv_t
 * NAME
 *	osm_smir_rcv_t
@@ -106,6 +105,7 @@ typedef struct _osm_smir
 	osm_mad_pool_t*				p_mad_pool;
 	osm_log_t*				p_log;
 	cl_plock_t*				p_lock;
+	cl_qlock_pool_t				pool;
 } osm_smir_rcv_t;
 /*
 * FIELDS
diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c
index 816e10d..440d773 100644
--- a/osm/opensm/osm_sa_class_port_info.c
+++ b/osm/opensm/osm_sa_class_port_info.c
@@ -197,7 +197,6 @@ __osm_cpi_rcv_respond(
      SwitchInfoRecord,
      RandomForwardingTableRecord,
      MulticastForwardingTableRecord,
-     SMInfoRecord (partial support),
      ServiceAssociationRecord
      other optional records supported "under the table"
 
diff --git a/osm/opensm/osm_sa_sminfo_record.c b/osm/opensm/osm_sa_sminfo_record.c
index 62467c1..7a82b84 100644
--- a/osm/opensm/osm_sa_sminfo_record.c
+++ b/osm/opensm/osm_sa_sminfo_record.c
@@ -33,7 +33,6 @@
  *
  */
 
-
 /*
  * Abstract:
  *    Implementation of osm_smir_rcv_t.
@@ -68,6 +67,25 @@
 #include <opensm/osm_msgdef.h>
 #include <opensm/osm_port.h>
 #include <opensm/osm_pkey.h>
+#include <opensm/osm_remote_sm.h>
+
+#define OSM_SMIR_RCV_POOL_MIN_SIZE     32
+#define OSM_SMIR_RCV_POOL_GROW_SIZE    32
+
+typedef  struct _osm_smir_item
+{
+  cl_pool_item_t           pool_item;
+  ib_sminfo_record_t       rec;
+} osm_smir_item_t;
+
+typedef  struct _osm_smir_search_ctxt
+{
+  const ib_sminfo_record_t* p_rcvd_rec;
+  ib_net64_t               comp_mask;
+  cl_qlist_t*              p_list;
+  osm_smir_rcv_t*          p_rcv;
+  const osm_physp_t*       p_req_physp;
+} osm_smir_search_ctxt_t;
 
 /**********************************************************************
  **********************************************************************/
@@ -76,6 +94,7 @@ osm_smir_rcv_construct(
   IN osm_smir_rcv_t* const p_rcv )
 {
   memset( p_rcv, 0, sizeof(*p_rcv) );
+  cl_qlock_pool_construct( &p_rcv->pool );
 }
 
 /**********************************************************************
@@ -87,7 +106,7 @@ osm_smir_rcv_destroy(
   CL_ASSERT( p_rcv );
 
   OSM_LOG_ENTER( p_rcv->p_log, osm_smir_rcv_destroy );
-
+  cl_qlock_pool_destroy( &p_rcv->pool );
   OSM_LOG_EXIT( p_rcv->p_log );
 }
 
@@ -116,26 +135,155 @@ osm_smir_rcv_init(
   p_rcv->p_stats = p_stats;
   p_rcv->p_mad_pool = p_mad_pool;
 
+  status = cl_qlock_pool_init( &p_rcv->pool,
+                               OSM_SMIR_RCV_POOL_MIN_SIZE,
+                               0,
+                               OSM_SMIR_RCV_POOL_GROW_SIZE,
+                               sizeof(osm_smir_item_t),
+                               NULL, NULL, NULL );
+
   OSM_LOG_EXIT( p_rcv->p_log );
   return( status );
 }
 
+static ib_api_status_t
+__osm_smir_rcv_new_smir(
+  IN osm_smir_rcv_t*         const p_rcv,
+  IN const osm_port_t*       const p_port,
+  IN cl_qlist_t*             const p_list,
+  IN ib_net64_t              const guid,
+  IN ib_net32_t              const act_count,
+  IN uint8_t                 const pri_state,
+  IN const osm_physp_t*      const p_req_physp )
+{
+  osm_smir_item_t*           p_rec_item;
+  ib_api_status_t            status = IB_SUCCESS;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_smir_rcv_new_smir );
+
+  p_rec_item = (osm_smir_item_t*)cl_qlock_pool_get( &p_rcv->pool );
+  if( p_rec_item == NULL )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_smir_rcv_new_smir: ERR 2801: "
+             "cl_qlock_pool_get failed\n" );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_smir_rcv_new_smir: "
+             "New SMInfo: GUID 0x%016" PRIx64 "\n",
+             cl_ntoh64( guid )
+             );
+  }
+
+  memset( &p_rec_item->rec, 0, sizeof(ib_sminfo_record_t) );
+
+  p_rec_item->rec.lid = osm_port_get_base_lid( p_port );
+  p_rec_item->rec.sm_info.guid = guid;
+  p_rec_item->rec.sm_info.act_count = act_count;
+  p_rec_item->rec.sm_info.pri_state = pri_state;
+
+  cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+  return( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_sa_smir_by_comp_mask(
+  IN osm_smir_rcv_t*        const p_rcv,
+  IN const osm_remote_sm_t* const p_rem_sm,
+  osm_smir_search_ctxt_t*   const p_ctxt )
+{
+  const ib_sminfo_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec;
+  const osm_physp_t*        const p_req_physp = p_ctxt->p_req_physp;
+  ib_net64_t                const comp_mask = p_ctxt->comp_mask;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_sa_smir_by_comp_mask );
+
+  if ( comp_mask & IB_SMIR_COMPMASK_GUID )
+  {
+    if ( p_rem_sm->smi.guid != p_rcvd_rec->sm_info.guid )
+      goto Exit;
+  }
+
+  if ( comp_mask & IB_SMIR_COMPMASK_PRIORITY )
+  {
+    if ( ib_sminfo_get_priority( &p_rem_sm->smi ) !=
+         ib_sminfo_get_priority( &p_rcvd_rec->sm_info ) )
+      goto Exit;
+  }
+
+  if ( comp_mask & IB_SMIR_COMPMASK_SMSTATE )
+  {
+    if ( ib_sminfo_get_state( &p_rem_sm->smi ) !=
+         ib_sminfo_get_state( &p_rcvd_rec->sm_info ) )
+      goto Exit;
+  }
+
+  /* Implement any other needed search cases */
+
+ __osm_smir_rcv_new_smir( p_rcv, p_rem_sm->p_port, p_ctxt->p_list,
+                          p_rem_sm->smi.guid,
+                          p_rem_sm->smi.act_count,
+                          p_rem_sm->smi.pri_state,
+                          p_req_physp );
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_sa_smir_by_comp_mask_cb(
+  IN cl_map_item_t*        const p_map_item,
+  IN void*                 context )
+{
+  const osm_remote_sm_t*   const p_rem_sm = (osm_remote_sm_t*)p_map_item;
+  osm_smir_search_ctxt_t*  const p_ctxt = (osm_smir_search_ctxt_t *)context;
+
+  __osm_sa_smir_by_comp_mask( p_ctxt->p_rcv, p_rem_sm, p_ctxt );
+}
+
 /**********************************************************************
  **********************************************************************/
 void
 osm_smir_rcv_process(
-  IN osm_smir_rcv_t*       const p_rcv,
-  IN const osm_madw_t*     const p_madw )
+  IN osm_smir_rcv_t*        const p_rcv,
+  IN const osm_madw_t*      const p_madw )
 {
-  const ib_sminfo_record_t*   p_sminfo_rec;
-  ib_sminfo_record_t*      p_resp_sminfo_rec;
-  const ib_sa_mad_t*       p_sa_mad;
-  ib_sa_mad_t*             p_resp_sa_mad;
-  osm_madw_t*              p_resp_madw;
-  ib_api_status_t          status;
-  osm_physp_t*             p_req_physp;
-  ib_net64_t               local_guid;
-  osm_port_t*              local_port;
+  const ib_sa_mad_t*        p_rcvd_mad;
+  const ib_sminfo_record_t* p_rcvd_rec;
+  const cl_qmap_t*          p_tbl;
+  const osm_port_t*         p_port = NULL;
+  const ib_sm_info_t*       p_smi;
+  cl_qlist_t                rec_list;
+  osm_madw_t*               p_resp_madw;
+  ib_sa_mad_t*              p_resp_sa_mad;
+  ib_sminfo_record_t*       p_resp_rec;
+  uint32_t                  num_rec, pre_trim_num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  uint32_t                  trim_num_rec;
+#endif
+  uint32_t                  i;
+  osm_smir_search_ctxt_t    context;
+  osm_smir_item_t*          p_rec_item;
+  ib_api_status_t           status = IB_SUCCESS;
+  ib_net64_t                comp_mask;
+  ib_net64_t                port_guid;
+  osm_physp_t*              p_req_physp;
+  osm_port_t*               local_port;
+  osm_remote_sm_t*          p_rem_sm;
+  cl_qmap_t*                p_sm_guid_tbl;
+  uint8_t                   pri_state;
 
   CL_ASSERT( p_rcv );
 
@@ -143,19 +291,20 @@ osm_smir_rcv_process(
 
   CL_ASSERT( p_madw );
 
-  p_sa_mad = osm_madw_get_sa_mad_ptr( p_madw );
-  p_sminfo_rec = (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_sa_mad );
+  p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw );
+  p_rcvd_rec = (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad );
+  comp_mask = p_rcvd_mad->comp_mask;
 
-  CL_ASSERT( p_sa_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD );
+  CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_SMINFO_RECORD );
 
   /* we only support SubnAdmGet and SubnAdmGetTable methods */
-  if ( (p_sa_mad->method != IB_MAD_METHOD_GET) &&
-       (p_sa_mad->method != IB_MAD_METHOD_GETTABLE) )
+  if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) &&
+       (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) )
   {
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
              "osm_smir_rcv_process: ERR 2804: "
              "Unsupported Method (%s)\n",
-             ib_get_sa_method_str( p_sa_mad->method ) );
+             ib_get_sa_method_str( p_rcvd_mad->method ) );
     osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR );
     goto Exit;
   }
@@ -173,72 +322,251 @@ osm_smir_rcv_process(
   }
 
   if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
-    osm_dump_sm_info_record( p_rcv->p_log, p_sminfo_rec, OSM_LOG_DEBUG );
+    osm_dump_sm_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG );
 
-  /* check the matching of pkeys with the local physp the SM is on. */
-  local_guid = p_rcv->p_subn->sm_port_guid;
-  local_port = (osm_port_t*)cl_qmap_get( &p_rcv->p_subn->port_guid_tbl, local_guid );
-  if (FALSE ==
-      osm_physp_share_pkey( p_rcv->p_log, p_req_physp,
-                            osm_port_get_default_phys_ptr( local_port ) ) )
+  p_tbl = &p_rcv->p_subn->sm_guid_tbl;
+  p_smi = &p_rcvd_rec->sm_info;
+
+  cl_qlist_init( &rec_list );
+
+  context.p_rcvd_rec = p_rcvd_rec;
+  context.p_list = &rec_list;
+  context.comp_mask = p_rcvd_mad->comp_mask;
+  context.p_rcv = p_rcv;
+  context.p_req_physp = p_req_physp;
+
+  cl_plock_acquire( p_rcv->p_lock );
+
+  /*
+    If the user specified a LID, it obviously narrows our
+    work load, since we don't have to search every port
+  */
+  if( comp_mask & IB_SMIR_COMPMASK_LID )
   {
-    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
-            "osm_smir_rcv_process: ERR 2805: "
-            "Cannot get SMInfo record due to pkey violation\n" );
+    status = osm_get_port_by_base_lid( p_rcv->p_subn, p_rcvd_rec->lid, &p_port );
+    if ( ( status != IB_SUCCESS ) || ( p_port == NULL ) )
+    {
+      status = IB_NOT_FOUND;
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_smir_rcv_process: ERR 2806: "
+               "No port found with LID 0x%x\n",
+               cl_ntoh16(p_rcvd_rec->lid) );
+    }
+  }
+
+  if ( status == IB_SUCCESS )
+  {
+    /* Handle our own SM first */
+    local_port = osm_get_port_by_guid( p_rcv->p_subn, p_rcv->p_subn->sm_port_guid );
+    if ( !local_port )
+    {
+      cl_plock_release( p_rcv->p_lock );
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_smir_rcv_process: ERR 2809: "
+               "No port found with GUID 0x%016" PRIx64 "\n",
+               cl_ntoh64(p_rcv->p_subn->sm_port_guid ) );
+      goto Exit;
+    }
+
+    if ( !p_port || local_port == p_port )
+    {
+      if (FALSE ==
+          osm_physp_share_pkey( p_rcv->p_log, p_req_physp,
+                                osm_port_get_default_phys_ptr( local_port ) ) )
+      {
+        cl_plock_release( p_rcv->p_lock );
+        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                 "osm_smir_rcv_process: ERR 2805: "
+                 "Cannot get SMInfo record due to pkey violation\n" );
+        goto Exit;
+      }
+
+      /* Check that other search components specified match */
+      if ( comp_mask & IB_SMIR_COMPMASK_GUID )
+      {
+        if ( p_rcv->p_subn->sm_port_guid != p_smi->guid )
+          goto Remotes;
+      }
+      if ( comp_mask & IB_SMIR_COMPMASK_PRIORITY )
+      {
+        if ( p_rcv->p_subn->opt.sm_priority != ib_sminfo_get_priority( p_smi ) )
+          goto Remotes;
+      }
+      if ( comp_mask & IB_SMIR_COMPMASK_SMSTATE )
+      {
+        if ( p_rcv->p_subn->sm_state != ib_sminfo_get_state( p_smi ) )
+          goto Remotes;
+      }
+
+      /* Now, add local SMInfo to list */
+      pri_state = p_rcv->p_subn->sm_state & 0x0F;
+      pri_state |= (p_rcv->p_subn->opt.sm_priority & 0x0F) << 4;
+      __osm_smir_rcv_new_smir( p_rcv, local_port, context.p_list, 
+                               p_rcv->p_subn->sm_port_guid,
+                               cl_ntoh32( p_rcv->p_stats->qp0_mads_sent ),
+                               pri_state,
+                               p_req_physp );
+    }
+
+ Remotes:
+    if( p_port && p_port != local_port )
+    {
+      /* Find remote SM corresponding to p_port */
+      port_guid = osm_port_get_guid( p_port );
+      p_sm_guid_tbl = &p_rcv->p_subn->sm_guid_tbl;
+      p_rem_sm = (osm_remote_sm_t*)cl_qmap_get( p_sm_guid_tbl, port_guid );
+      if (p_rem_sm != (osm_remote_sm_t*)cl_qmap_end( p_sm_guid_tbl ) )
+        __osm_sa_smir_by_comp_mask( p_rcv, p_rem_sm, &context );
+      else
+      {
+        osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+                 "osm_smir_rcv_process: ERR 280A: "
+                 "No remote SM for GUID  0x%016" PRIx64 "\n",
+                 cl_ntoh64( port_guid ) );
+      }
+    }
+    else
+    {
+      /* Go over all other known (remote) SMs */
+      cl_qmap_apply_func( &p_rcv->p_subn->sm_guid_tbl,
+                          __osm_sa_smir_by_comp_mask_cb,
+                          &context );
+    }
+  }
+
+  cl_plock_release( p_rcv->p_lock );
+
+  num_rec = cl_qlist_count( &rec_list );
+
+  /*
+   * C15-0.1.30:
+   * If we do a SubnAdmGet and got more than one record it is an error !
+   */
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
+    {
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
+    }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_smir_rcv_process: ERR 2808: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
+
+      /* need to set the mem free ... */
+      p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_smir_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
+  }
+
+  pre_trim_num_rec = num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_sminfo_record_t);
+  if (trim_num_rec < num_rec)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
+             "osm_smir_rcv_process: "
+             "Number of records:%u trimmed to:%u to fit in one MAD\n",
+             num_rec, trim_num_rec );
+    num_rec = trim_num_rec;
+  }
+#endif
+
+  osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+           "osm_smir_rcv_process: "
+           "Returning %u records\n", num_rec );
+
+  if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0))
+  {
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_NO_RECORDS );
     goto Exit;
   }
 
-  /*  
-   *  Get a MAD to reply. Address of Mad is in the received mad_wrapper
+  /*
+   * Get a MAD to reply. Address of Mad is in the received mad_wrapper
    */
-  p_resp_madw = osm_mad_pool_get(p_rcv->p_mad_pool,
-                                 p_madw->h_bind,
-                                 sizeof(ib_sminfo_record_t) + IB_SA_MAD_HDR_SIZE,
-                                 &p_madw->mad_addr );
+  p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool,
+                                  p_madw->h_bind,
+                                  num_rec * sizeof(ib_sminfo_record_t) + IB_SA_MAD_HDR_SIZE,
+                                  &p_madw->mad_addr );
+
   if( !p_resp_madw )
   {
     osm_log(p_rcv->p_log, OSM_LOG_ERROR,
-            "osm_smir_rcv_process: ERR 2801: "
-            "Unable to acquire response MAD\n" );
+            "osm_smir_rcv_process: ERR 2807: "
+            "osm_mad_pool_get failed\n" );
+
+    for( i = 0; i < num_rec; i++ )
+    {
+      p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list );
+      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    }
+
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_NO_RESOURCES );
+
     goto Exit;
   }
 
   p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw );
-  p_resp_sminfo_rec =
-    (ib_sminfo_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
 
-  p_resp_sminfo_rec->resv0 = 0;
-
-  /* HACK: This handling is incorrect. Records of known SMs
-     by our SM, and not just the details of our own SM
-     should be returned. */
-
-  cl_plock_acquire( p_rcv->p_lock );
+  /*
+    Copy the MAD header back into the response mad.
+    Set the 'R' bit and the payload length,
+    Then copy all records from the list into the response payload.
+  */
 
-  /* get our local sm_base_lid to send in the sminfo */
-  p_resp_sminfo_rec->lid = p_rcv->p_subn->sm_base_lid;
-  p_resp_sminfo_rec->sm_info.guid = p_rcv->p_subn->sm_port_guid;
-  p_resp_sminfo_rec->sm_info.sm_key = p_rcv->p_subn->opt.sm_key;
-  p_resp_sminfo_rec->sm_info.act_count =
-    cl_ntoh32(p_rcv->p_stats->qp0_mads_sent);
-  p_resp_sminfo_rec->sm_info.pri_state = p_rcv->p_subn->sm_state;
+  memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE );
+  p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK;
+  /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
+  p_resp_sa_mad->sm_key = 0;
+  /* Fill in the offset (paylen will be done by the rmpp SAR) */
+  p_resp_sa_mad->attr_offset =
+    ib_get_attr_offset( sizeof(ib_sminfo_record_t) );
 
-  cl_plock_release( p_rcv->p_lock );
+  p_resp_rec = (ib_sminfo_record_t*)
+    ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
 
-  /*  Copy the MAD header back into the response mad */
-  memcpy( p_resp_sa_mad, p_sa_mad, IB_SA_MAD_HDR_SIZE );
-  if( p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE )
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we support only one packet RMPP - so we will set the first and
+     last flags for gettable */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
   {
-    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE;
-    /* Fill in the offset (paylen will be done by the rmpp SAR) */
-    p_resp_sa_mad->attr_offset =
-      ib_get_attr_offset( sizeof(ib_sminfo_record_t) );
+    p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA;
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE;
   }
+#else
+  /* forcefully define the packet as RMPP one */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE;
+#endif
 
-  p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK;
+  for( i = 0; i < pre_trim_num_rec; i++ )
+  {
+    p_rec_item = (osm_smir_item_t*)cl_qlist_remove_head( &rec_list );
+    /* copy only if not trimmed */
+    if (i < num_rec)
+    {
+      *p_resp_rec = p_rec_item->rec;
+      p_resp_rec->sm_info.sm_key = 0;
+    }
+    cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    p_resp_rec++;
+  }
 
-  /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
-  p_resp_sa_mad->sm_key = 0;
+  CL_ASSERT( cl_is_qlist_empty( &rec_list ) );
 
   status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE );
   if( status != IB_SUCCESS )


From svenar at simula.no  Fri Dec 22 09:10:52 2006
From: svenar at simula.no (Sven-Arne Reinemo)
Date: Fri, 22 Dec 2006 18:10:52 +0100
Subject: [openib-general] SA redirect
Message-ID: <458C119C.6090302@simula.no>

Hi,

One quick question, is SA redirection supported by OpenSM? I did a
check, but could not find any information about this.

-- 
SAR
---- GnuPG public key - http://home.ifi.uio.no/~svenar/gpg.asc ----
 "There are only 10 kinds of people in this world; those who know
  binary and those who don't."
                                           -- Unknown


From svenar at simula.no  Fri Dec 22 09:07:37 2006
From: svenar at simula.no (Sven-Arne Reinemo)
Date: Fri, 22 Dec 2006 18:07:37 +0100
Subject: [openib-general] Bug in IBMgtSim?
Message-ID: <458C10D9.2080909@simula.no>

Hi,

There seemed to be a bug in IBMgtSim where it forwards packets received
from the SM back onto the port where the SM is connected. OpenSM just
drops the packet so it does not seem very problematic, but I am just
checking to see if anyone else see this behaviour. Below are an example
from the log files.

Packet dropped by OpenSM:

Dec 15 15:21:34 992520 [B44E3BB0] -> Duplicate TID 0x6A7B00001234
received (not a response). Dropping the MAD.


Packets forwarded to the SM (the first one is the one that is dropped):

-I- Using random seed:96960
-I- Parsing topology definition:/simulator/scalability/test_top.topo
-I- Defined 3/3 systems/nodes
-I- Init fabric: fabric:1
-I- Started server: opensm.simula.no port:60493
-I- Ready to serve
-I- Connecting: sock9 127.0.1.1 48227
--------------------------------------------------------
  sl                       0x0
  pkey_index               0x0
  slid                     0x0
  dlid                     0xffff
  sqpn                     0x0
  dqpn                     0x0
--------------------------------------------------------
--------------------------------------------------------
  base_ver                 0x1
  mgmt_class               0x81
  class_ver                0x1
  method                   0x1
  status                   0x0
  class_spec               0x100
  trans_id                 0x00006a7b00001234
  attr_id                  0x11
  attr_mod                 0x0
--------------------------------------------------------
--------------------------------------------------------
  sl                       0x0
  pkey_index               0x0
  slid                     0xffff
  dlid                     0x0
  sqpn                     0x0
  dqpn                     0x0
--------------------------------------------------------
--------------------------------------------------------
  base_ver                 0x1
  mgmt_class               0x81
  class_ver                0x1
  method                   0x81
  status                   0x8000
  class_spec               0x0
  trans_id                 0x00006a7b00001234
  attr_id                  0x11
  attr_mod                 0x0
--------------------------------------------------------


-- 
SAR
---- GnuPG public key - http://home.ifi.uio.no/~svenar/gpg.asc ----
 "There are only 10 kinds of people in this world; those who know
  binary and those who don't."
                                           -- Unknown


From dgrruw at yahoo.co.jp  Fri Dec 22 10:10:02 2006
From: dgrruw at yahoo.co.jp (dgrruw at yahoo.co.jp)
Date: Sat, 23 Dec 2006 02:10:02 +0800
Subject: [openib-general] =?GB2312?B?vKSwsqOh?=
Message-ID: <20061222180842.111703B0011@sentry-two.sandia.gov>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061223/ef608924/attachment.html>

From halr at voltaire.com  Fri Dec 22 11:49:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 22 Dec 2006 14:49:08 -0500
Subject: [openib-general] SA redirect
In-Reply-To: <458C119C.6090302@simula.no>
References: <458C119C.6090302@simula.no>
Message-ID: <1166816948.4519.197118.camel@hal.voltaire.com>

On Fri, 2006-12-22 at 12:10, Sven-Arne Reinemo wrote:
> Hi,
> 
> One quick question, is SA redirection supported by OpenSM? I did a
> check, but could not find any information about this.

It's not currently supported.

-- Hal


From jsquyres at cisco.com  Fri Dec 22 12:31:08 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Fri, 22 Dec 2006 15:31:08 -0500
Subject: [openib-general] DNS changes
Message-ID: <BC36CF32-A7D4-425E-A25C-BA463C5FACE8@cisco.com>

In order to move on to the next phase of the server transition, we  
have made some changes to the OFA DNS.  Most users should not notice  
the changes (if you encounter any problems, please let me know ASAP);  
the gist of it is that we have a few more names that are slowly  
creeping their way around the world:

ssh.openfabrics.org
svn.openfabrics.org
git.openfabrics.org
lists.openfabrics.org
wiki.openfabrics.org
bugs.openfabrics.org
www2.openfabrics.org

All of these point to the new server.  "www2" is only for testing  
purposes; it will eventually go away when "www" is switched to point  
to the new server.

Back-end services are not yet hooked up to these names; we'll do that  
in the not-distant future (probably in the new year at this point).

Also note that the name "staging.openfabrics.org" will eventually go  
away -- at some point after all the new names are in place and the  
dust has settled.  There will be adequate warning before this occurs  
(so that you can get new git checkouts, etc.), so consider this an  
early warning.

Happy holidays!

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From photo at oto.com  Fri Dec 22 16:16:09 2006
From: photo at oto.com (photo at oto.com)
Date: 23 Dec 2006 02:16:09 +0200
Subject: ���� ���� ���� ������
Message-ID: <20061223021552.C8310AD13A9AE309@oto.com>


�
��� ���� ����� ��� ���� ��� ���� ����

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061223/d2a351bc/attachment.html>

From eitan at sw053.yok.mtl.com  Fri Dec 22 21:28:23 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Sat, 23 Dec 2006 07:28:23 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion
Message-ID: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1
ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
Total=396 Pass=393 Fail=3

Pass:
54 Stability IS1-16.topo
54 Pkey IS1-16.topo
54 Multicast IS1-16.topo
54 LidMgr IS1-16.topo
53 OsmStress IS1-16.topo
18 Stability IS3-loop.topo
18 Stability IS3-128.topo
18 OsmStress IS3-128.topo
18 Multicast IS3-loop.topo
18 LidMgr IS3-128.topo
17 Pkey IS3-128.topo
17 Multicast IS3-128.topo

Failures:
1 Pkey IS3-128.topo
1 OsmStress IS1-16.topo
1 Multicast IS3-128.topo


From eitan at mellanox.co.il  Sat Dec 23 00:52:24 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 23 Dec 2006 10:52:24 +0200
Subject: [openib-general] Bug in IBMgtSim?
In-Reply-To: <458C10D9.2080909@simula.no>
References: <458C10D9.2080909@simula.no>
Message-ID: <458CEE48.3020207@mellanox.co.il>

Hi Sven,

Yes this is the behavior of the simulator.
Any MAD leaving from a node and target to the same node also appears 
back at the sender input.
This is also the behavior of the gen1 IB stack. As you pointed out the 
SM (osm vendor layer) is
capable of dropping these duplicated MADs.

The "bug" stems from the fact the MADs when injected into the simulator 
do not carry any information about the
client which injected them. So when they finally arrive to the 
destination they are being forwarded to all "MAD
processors" attached to that node (to the specific "management class"). 
So if the SM is registered as "MAD Processor"
for SMPs on node A and sends SMP to node A the SMP will be received on 
the SM as well as on any other "MAD Processor"
attached to that node (e.g. the SMA on the node A).

Eitan

Sven-Arne Reinemo wrote:
> Hi,
>
> There seemed to be a bug in IBMgtSim where it forwards packets received
> from the SM back onto the port where the SM is connected. OpenSM just
> drops the packet so it does not seem very problematic, but I am just
> checking to see if anyone else see this behaviour. Below are an example
> from the log files.
>   
> Packet dropped by OpenSM:
>
> Dec 15 15:21:34 992520 [B44E3BB0] -> Duplicate TID 0x6A7B00001234
> received (not a response). Dropping the MAD.
>
>
> Packets forwarded to the SM (the first one is the one that is dropped):
>
> -I- Using random seed:96960
> -I- Parsing topology definition:/simulator/scalability/test_top.topo
> -I- Defined 3/3 systems/nodes
> -I- Init fabric: fabric:1
> -I- Started server: opensm.simula.no port:60493
> -I- Ready to serve
> -I- Connecting: sock9 127.0.1.1 48227
> --------------------------------------------------------
>   sl                       0x0
>   pkey_index               0x0
>   slid                     0x0
>   dlid                     0xffff
>   sqpn                     0x0
>   dqpn                     0x0
> --------------------------------------------------------
> --------------------------------------------------------
>   base_ver                 0x1
>   mgmt_class               0x81
>   class_ver                0x1
>   method                   0x1
>   status                   0x0
>   class_spec               0x100
>   trans_id                 0x00006a7b00001234
>   attr_id                  0x11
>   attr_mod                 0x0
> --------------------------------------------------------
> --------------------------------------------------------
>   sl                       0x0
>   pkey_index               0x0
>   slid                     0xffff
>   dlid                     0x0
>   sqpn                     0x0
>   dqpn                     0x0
> --------------------------------------------------------
> --------------------------------------------------------
>   base_ver                 0x1
>   mgmt_class               0x81
>   class_ver                0x1
>   method                   0x81
>   status                   0x8000
>   class_spec               0x0
>   trans_id                 0x00006a7b00001234
>   attr_id                  0x11
>   attr_mod                 0x0
> --------------------------------------------------------
>
>
>   


From eitan at mellanox.co.il  Sat Dec 23 01:30:09 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 23 Dec 2006 11:30:09 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-23:normal
 completion
In-Reply-To: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com>
References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com>
Message-ID: <458CF721.9030909@mellanox.co.il>

Analysis of the 3 failures:
1. TEST BUG: osmStress - somehow the simulation caused the local port to 
be turned down (must be a bug in the random error injection which
   should avoid that port). So the simulation ends with OpenSM still 
trying to connect to the network.
2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where 
received by the SM. This will require further debug.
3. PKey: the test fails on obtaining ALL path records for the 128 node 
case. OpenSM complain about timeout during the RMPP transaction.
   I will add some more time to the transaction timeout for the simulation.

EZ
 
Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1
> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
> Total=396 Pass=393 Fail=3
>
> Pass:
> 54 Stability IS1-16.topo
> 54 Pkey IS1-16.topo
> 54 Multicast IS1-16.topo
> 54 LidMgr IS1-16.topo
> 53 OsmStress IS1-16.topo
> 18 Stability IS3-loop.topo
> 18 Stability IS3-128.topo
> 18 OsmStress IS3-128.topo
> 18 Multicast IS3-loop.topo
> 18 LidMgr IS3-128.topo
> 17 Pkey IS3-128.topo
> 17 Multicast IS3-128.topo
>
> Failures:
> 1 Pkey IS3-128.topo
> 1 OsmStress IS1-16.topo
> 1 Multicast IS3-128.topo
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at sw053.yok.mtl.com  Sat Dec 23 09:49:28 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Sat, 23 Dec 2006 19:49:28 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-23:normal completion
Message-ID: <200612231749.kBNHnSWU029121@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1
ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 
Total=81 Pass=80 Fail=1

Pass:
9 Stability IS1-16.topo
9 Pkey IS1-16.topo
9 OsmTest IS1-16.topo
9 OsmStress IS1-16.topo
9 Multicast IS1-16.topo
9 LidMgr IS1-16.topo
3 Stability IS3-loop.topo
3 Stability IS3-128.topo
3 Pkey IS3-128.topo
3 OsmTest IS3-loop.topo
3 OsmTest IS3-128.topo
3 OsmStress IS3-128.topo
3 Multicast IS3-loop.topo
3 LidMgr IS3-128.topo
2 Multicast IS3-128.topo

Failures:
1 Multicast IS3-128.topo


From halr at voltaire.com  Sat Dec 23 17:31:14 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 23 Dec 2006 20:31:14 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-23:normal
 completion
In-Reply-To: <458CF721.9030909@mellanox.co.il>
References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com>
	<458CF721.9030909@mellanox.co.il>
Message-ID: <1166923872.4519.284465.camel@hal.voltaire.com>

On Sat, 2006-12-23 at 04:30, Eitan Zahavi wrote:
> Analysis of the 3 failures:
> 1. TEST BUG: osmStress - somehow the simulation caused the local port to 
> be turned down (must be a bug in the random error injection which
>    should avoid that port). So the simulation ends with OpenSM still 
> trying to connect to the network.
> 2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where 
> received by the SM. This will require further debug.
> 3. PKey: the test fails on obtaining ALL path records for the 128 node 
> case. OpenSM complain about timeout during the RMPP transaction.

Why did this happen now ?

>    I will add some more time to the transaction timeout for the simulation.
> 
> EZ
>  
> Eitan Zahavi wrote:
> > OSM Simulation Regression Summary
> > OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1
> > ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
> > Total=396 Pass=393 Fail=3
> >
> > Pass:
> > 54 Stability IS1-16.topo
> > 54 Pkey IS1-16.topo
> > 54 Multicast IS1-16.topo
> > 54 LidMgr IS1-16.topo
> > 53 OsmStress IS1-16.topo
> > 18 Stability IS3-loop.topo
> > 18 Stability IS3-128.topo
> > 18 OsmStress IS3-128.topo
> > 18 Multicast IS3-loop.topo
> > 18 LidMgr IS3-128.topo
> > 17 Pkey IS3-128.topo
> > 17 Multicast IS3-128.topo
> >
> > Failures:
> > 1 Pkey IS3-128.topo
> > 1 OsmStress IS1-16.topo
> > 1 Multicast IS3-128.topo
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From eitan at mellanox.co.il  Sat Dec 23 23:16:38 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 24 Dec 2006 09:16:38 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-23:normal
 completion
In-Reply-To: <1166923872.4519.284465.camel@hal.voltaire.com>
References: <200612230528.kBN5SNOl018673@sw053.yok.mtl.com>
	<458CF721.9030909@mellanox.co.il>
	<1166923872.4519.284465.camel@hal.voltaire.com>
Message-ID: <458E2956.7090700@mellanox.co.il>

Hal Rosenstock wrote:
> On Sat, 2006-12-23 at 04:30, Eitan Zahavi wrote:
>   
>> Analysis of the 3 failures:
>> 1. TEST BUG: osmStress - somehow the simulation caused the local port to 
>> be turned down (must be a bug in the random error injection which
>>    should avoid that port). So the simulation ends with OpenSM still 
>> trying to connect to the network.
>> 2. Multicast - The osm.mcfdbs is empty. Apparently no Joins where 
>> received by the SM. This will require further debug.
>>     
I have found a design bug in the queue of MADs waiting for dispatching.
I used an STL map which means that if two MADs where scheduled for the 
exact same time the last time was purging the previous one.
The fix is simple - use multimap instead. It is under testing now.
>> 3. PKey: the test fails on obtaining ALL path records for the 128 node 
>> case. OpenSM complain about timeout during the RMPP transaction.
>>     
>
> Why did this happen now ?
>   
Wish I knew. As I said it is under investigation. Might just be a big 
enough transaction to overflow the 2sec timeout.
>   
>>    I will add some more time to the transaction timeout for the simulation.
>>
>> EZ
>>  
>> Eitan Zahavi wrote:
>>     
>>> OSM Simulation Regression Summary
>>> OpenSM rev = Fri_Dec_22_09:28:50_2006 3ceb7c MOD_FILES=1
>>> ibutils rev = Mon_Dec_18_16:00:49_2006 11d857 
>>> Total=396 Pass=393 Fail=3
>>>
>>> Pass:
>>> 54 Stability IS1-16.topo
>>> 54 Pkey IS1-16.topo
>>> 54 Multicast IS1-16.topo
>>> 54 LidMgr IS1-16.topo
>>> 53 OsmStress IS1-16.topo
>>> 18 Stability IS3-loop.topo
>>> 18 Stability IS3-128.topo
>>> 18 OsmStress IS3-128.topo
>>> 18 Multicast IS3-loop.topo
>>> 18 LidMgr IS3-128.topo
>>> 17 Pkey IS3-128.topo
>>> 17 Multicast IS3-128.topo
>>>
>>> Failures:
>>> 1 Pkey IS3-128.topo
>>> 1 OsmStress IS1-16.topo
>>> 1 Multicast IS3-128.topo
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Sun Dec 24 00:49:25 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 24 Dec 2006 10:49:25 +0200
Subject: [openib-general] [PATCH  v4 01/13] Linux RDMA Core Changes
In-Reply-To: <20061214135303.21159.61880.stgit@dell3.ogc.int>
References: <20061214135233.21159.78613.stgit@dell3.ogc.int>
	<20061214135303.21159.61880.stgit@dell3.ogc.int>
Message-ID: <20061224084925.GD15106@mellanox.co.il>

> diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
> index 283d50b..15cbd49 100644
> --- a/drivers/infiniband/hw/mthca/mthca_cq.c
> +++ b/drivers/infiniband/hw/mthca/mthca_cq.c
> @@ -722,7 +722,8 @@ repoll:
>  	return err == 0 || err == -EAGAIN ? npolled : err;
>  }
>  
> -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify)
> +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, 
> +		       struct ib_udata *udata)
>  {
>  	__be32 doorbell[2];
>  
> @@ -739,7 +740,8 @@ int mthca_tavor_arm_cq(struct ib_cq *cq,
>  	return 0;
>  }
>  
> -int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify)
> +int mthca_arbel_arm_cq(struct ib_cq *ibcq, enum ib_cq_notify notify,
> +		       struct ib_udata *udata)
>  {
>  	struct mthca_cq *cq = to_mcq(ibcq);
>  	__be32 doorbell[2];
> diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h
> index fe5cecf..6b9ccf6 100644
> --- a/drivers/infiniband/hw/mthca/mthca_dev.h
> +++ b/drivers/infiniband/hw/mthca/mthca_dev.h
> @@ -493,8 +493,8 @@ void mthca_unmap_eq_icm(struct mthca_dev
>  
>  int mthca_poll_cq(struct ib_cq *ibcq, int num_entries,
>  		  struct ib_wc *entry);
> -int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
> -int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify);
> +int mthca_tavor_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
> +int mthca_arbel_arm_cq(struct ib_cq *cq, enum ib_cq_notify notify, struct ib_udata *udata);
>  int mthca_init_cq(struct mthca_dev *dev, int nent,
>  		  struct mthca_ucontext *ctx, u32 pdn,
>  		  struct mthca_cq *cq);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 8eacc35..e3e1a2c 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -941,7 +941,8 @@ struct ib_device {
>  					      struct ib_wc *wc);
>  	int                        (*peek_cq)(struct ib_cq *cq, int wc_cnt);
>  	int                        (*req_notify_cq)(struct ib_cq *cq,
> -						    enum ib_cq_notify cq_notify);
> +						    enum ib_cq_notify cq_notify,
> +						    struct ib_udata *udata);
>  	int                        (*req_ncomp_notif)(struct ib_cq *cq,
>  						      int wc_cnt);
>  	struct ib_mr *             (*get_dma_mr)(struct ib_pd *pd,
> @@ -1373,7 +1374,7 @@ int ib_peek_cq(struct ib_cq *cq, int wc_
>  static inline int ib_req_notify_cq(struct ib_cq *cq,
>  				   enum ib_cq_notify cq_notify)
>  {
> -	return cq->device->req_notify_cq(cq, cq_notify);
> +	return cq->device->req_notify_cq(cq, cq_notify, NULL);
>  }
>  
>  /**

Can't say I like this adding overhead in data path operations (and note this
can't be optimized out). And kernel consumers work without passing it in, so it
hurts kernel code even for Chelsio. Granted, the cost is small here, but these
things do tend to add up.

It seems all Chelsio needs is to pass in a consumer index - so, how about a new
entry point? Something like void set_cq_udata(struct ib_cq *cq, struct ib_udata *udata)?

-- 
MST


From eitan at mellanox.co.il  Sun Dec 24 04:35:14 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 24 Dec 2006 14:35:14 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4588EAB9.6080106@voltaire.com>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
Message-ID: <458E7402.4000106@mellanox.co.il>

Hi Or,

Sorry it took me a while.

According to the IBTA spec:
1. In order for MTU and MTUSelector to have any effect their component 
mask bits MUST be set to 1 in the query
2. Behavior of the SM is defined with small "freedom" to choose between 
multiple matching MTU values if they exist.
3. The table below summarizes all options:

Assuming the value M represents the lowest MTU on the path
We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
R represents the MTU value in the request. Similarly R-1 is one below R 
and R+1 is one above R.

Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
w. Tavor End Port
-----------------------------------------------------------------------------------------
UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR

I have built some test code for making sure OpenSM does what is required.
Apparently it does not. In any case the M is not identical to R it fails 
the request.

I am working on fixing OpenSM.

Any comments are welcome.

EZ

Or Gerlitz wrote:
> Michael S. Tsirkin wrote:
>   
>> I am not yet sure what is best for upstream, so I don't really think we need
>> any RFCs.
>>     
>
>   
>> We'll need data from SM guys on whether MTU selector actually works
>> in SMs, and if not what happens when you enable it.
>>     
>
> Eitan,
>
> Can you please post here the tavor-quirk patch which was integrated into 
> opensm? i can see the ***code*** of the opensm but might make some wrong 
> assumptions or get into wrong understandings as i am not able to see the 
> patch as is.
>
> Or.
>
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From eitan at mellanox.co.il  Sun Dec 24 04:40:18 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 24 Dec 2006 14:40:18 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
Message-ID: <458E7532.5030400@mellanox.co.il>

Hi Hal,

OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
query even though the selector might be allowing for selecting an
appropriate value.

I have made the attached ibis based program for testing MTU select.

After this fix the following results are obtained for a case of
path allowing maximal 2K MTU .

In standard mode:
------------------------------------------------------------
MTU greater then ... 256     (0x01) ->  equal to ....... 2K
MTU less then ...... 256     (0x41) ->  NO PATHS
MTU equal to ....... 256     (0x81) ->  equal to ....... 256
MTU largest possible 256     (0xc1) ->  equal to ....... 2K
MTU greater then ... 512     (0x02) ->  equal to ....... 2K
MTU less then ...... 512     (0x42) ->  equal to ....... 256
MTU equal to ....... 512     (0x82) ->  equal to ....... 512
MTU largest possible 512     (0xc2) ->  equal to ....... 2K
MTU greater then ... 1K      (0x03) ->  equal to ....... 2K
MTU less then ...... 1K      (0x43) ->  equal to ....... 512
MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
MTU greater then ... 2K      (0x04) ->  NO PATHS
MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
MTU greater then ... 4K      (0x05) ->  NO PATHS
MTU less then ...... 4K      (0x45) ->  equal to ....... 2K
MTU equal to ....... 4K      (0x85) ->  NO PATHS
MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
============================================================

With enable_quirks (when one of the ends is a Tavor device):
------------------------------------------------------------
MTU greater then ... 256     (0x01) ->  equal to ....... 1K
MTU less then ...... 256     (0x41) ->  NO PATHS
MTU equal to ....... 256     (0x81) ->  equal to ....... 256
MTU largest possible 256     (0xc1) ->  equal to ....... 2K
MTU greater then ... 512     (0x02) ->  equal to ....... 1K
MTU less then ...... 512     (0x42) ->  equal to ....... 256
MTU equal to ....... 512     (0x82) ->  equal to ....... 512
MTU largest possible 512     (0xc2) ->  equal to ....... 2K
MTU greater then ... 1K      (0x03) ->  NO PATHS
MTU less then ...... 1K      (0x43) ->  equal to ....... 512
MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
MTU greater then ... 2K      (0x04) ->  NO PATHS
MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
MTU greater then ... 4K      (0x05) ->  NO PATHS
MTU less then ...... 4K      (0x45) ->  equal to ....... 1K
MTU equal to ....... 4K      (0x85) ->  NO PATHS
MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
============================================================

Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>

---
commit 7a156fd924a543b9891c676024a3dd9a90f848a9
tree 43a00fa2792aeb7d5684c6817154c9338ca96ed9
parent 613e7eea4d14a69e1faaaf251cb88f40dfe5e5a6
author Eitan Zahavi <eitan at sw053.yok.mtl.com> Sun, 24 Dec 2006 14:31:21 
+0200
committer Eitan Zahavi <eitan at sw053.yok.mtl.com> Sun, 24 Dec 2006 
14:31:21 +0200

 osm/opensm/osm_sa_multipath_record.c |   83 
+++++++++++++++++++++++-----------
 osm/opensm/osm_sa_path_record.c      |   48 ++++++++++++++++----
 2 files changed, 93 insertions(+), 38 deletions(-)

diff --git a/osm/opensm/osm_sa_multipath_record.c 
b/osm/opensm/osm_sa_multipath_record.c
index 28a0190..3eb7a6d 100644
--- a/osm/opensm/osm_sa_multipath_record.c
+++ b/osm/opensm/osm_sa_multipath_record.c
@@ -615,20 +615,29 @@ __osm_mpr_rcv_get_path_parms(
     required_mtu = ib_multipath_rec_mtu( p_mpr );
     switch ( ib_multipath_rec_mtu_sel( p_mpr ) )
     {
-      case 0:    /* must be greater than */
+      case 0:  /* must be greater than */
         if ( mtu <= required_mtu )
           status = IB_NOT_FOUND;
         break;
 
-      case 1:    /* must be less than */
-        if ( mtu >= required_mtu )
-          status = IB_NOT_FOUND;
-        break;
-
-      case 2:    /* exact match */
-        if ( mtu != required_mtu )
-          status = IB_NOT_FOUND;
-        break;
+      case 1:  /* must be less than */
+         if( mtu >= required_mtu )
+         {
+            /* adjust to use the highest mtu
+               lower then the required one */
+            if (required_mtu > 1)
+               mtu = required_mtu - 1;
+            else
+               status = IB_NOT_FOUND;
+         }
+         break;
+
+      case 2:  /* exact match */
+         if( mtu < required_mtu )
+            status = IB_NOT_FOUND;
+         else
+            mtu = required_mtu;
+         break;
 
       case 3:    /* largest available */
         /* can't be disqualified by this one */
@@ -646,22 +655,31 @@ __osm_mpr_rcv_get_path_parms(
   if ( ( comp_mask & IB_MPR_COMPMASK_RATESELEC ) &&
        ( comp_mask & IB_PR_COMPMASK_RATE ) )
   {
-    required_rate = ib_multipath_rec_rate( p_mpr );
-    switch ( ib_multipath_rec_rate_sel( p_mpr ) )
-    {
-      case 0:    /* must be greater than */
+     required_rate = ib_multipath_rec_rate( p_mpr );
+     switch ( ib_multipath_rec_rate_sel( p_mpr ) )
+     {
+     case 0:   /* must be greater than */
         if ( rate <= required_rate )
-          status = IB_NOT_FOUND;
+           status = IB_NOT_FOUND;
         break;
-
-      case 1:    /* must be less than */
-        if ( rate >= required_rate )
-          status = IB_NOT_FOUND;
+       
+     case 1:   /* must be less than */
+        if( rate >= required_rate )
+        {
+           /* adjust the rate to use the highest rate
+              lower then the required one */
+           if (required_rate > 2)
+              rate = required_rate - 1;
+           else
+              status = IB_NOT_FOUND;
+        }
         break;
-
-      case 2:    /* exact match */
-        if ( rate != required_rate )
-          status = IB_NOT_FOUND;
+       
+     case 2:   /* exact match */
+        if( rate < required_rate )
+           status = IB_NOT_FOUND;
+        else
+           rate = required_rate;
         break;
 
       case 3:    /* largest available */
@@ -697,13 +715,22 @@ __osm_mpr_rcv_get_path_parms(
       break;
 
     case 1:    /* must be less than */
-      if ( pkt_life >= required_pkt_life )
-        status = IB_NOT_FOUND;
-      break;
+       if( pkt_life >= required_pkt_life )
+       {
+          /* adjust the lifetime to use the highest possible
+             lower then the required one */
+          if (required_pkt_life > 1)
+             pkt_life = required_pkt_life - 1;
+          else
+             status = IB_NOT_FOUND;
+       }
+       break;
 
     case 2:    /* exact match */
-      if ( pkt_life != required_pkt_life )
-        status = IB_NOT_FOUND;
+       if( pkt_life < required_pkt_life )
+          status = IB_NOT_FOUND;
+       else
+          pkt_life = required_pkt_life;
       break;
 
     case 3:    /* smallest available */
diff --git a/osm/opensm/osm_sa_path_record.c 
b/osm/opensm/osm_sa_path_record.c
index 7f4a1b6..6d2e64e 100644
--- a/osm/opensm/osm_sa_path_record.c
+++ b/osm/opensm/osm_sa_path_record.c
@@ -528,6 +528,7 @@ __osm_pr_rcv_get_path_parms(
 
   /*
     Determine if these values meet the user criteria
+     and adjust appropriatly
   */
 
   /* we silently ignore cases where only the MTU selector is defined */
@@ -543,13 +544,22 @@ __osm_pr_rcv_get_path_parms(
       break;
 
     case 1:    /* must be less than */
-      if( mtu >= required_mtu )
-        status = IB_NOT_FOUND;
+       if( mtu >= required_mtu )
+       {
+          /* adjust to use the highest mtu
+             lower then the required one */
+          if (required_mtu > 1)
+             mtu = required_mtu - 1;
+          else
+             status = IB_NOT_FOUND;
+       }
       break;
 
     case 2:    /* exact match */
-      if( mtu != required_mtu )
-        status = IB_NOT_FOUND;
+      if( mtu < required_mtu )
+         status = IB_NOT_FOUND;
+      else
+         mtu = required_mtu;
       break;
 
     case 3:    /* largest available */
@@ -578,12 +588,21 @@ __osm_pr_rcv_get_path_parms(
 
     case 1:    /* must be less than */
       if( rate >= required_rate )
-        status = IB_NOT_FOUND;
+      {
+         /* adjust the rate to use the highest rate
+            lower then the required one */
+         if (required_rate > 2)
+            rate = required_rate - 1;
+         else
+            status = IB_NOT_FOUND;
+      }
       break;
 
     case 2:    /* exact match */
-      if( rate != required_rate )
-        status = IB_NOT_FOUND;
+      if( rate < required_rate )
+         status = IB_NOT_FOUND;
+      else
+         rate = required_rate;
       break;
 
     case 3:    /* largest available */
@@ -620,12 +639,21 @@ __osm_pr_rcv_get_path_parms(
 
     case 1:    /* must be less than */
       if( pkt_life >= required_pkt_life )
-        status = IB_NOT_FOUND;
+      {
+         /* adjust the lifetime to use the highest possible
+            lower then the required one */
+         if (required_pkt_life > 1)
+            pkt_life = required_pkt_life - 1;
+         else
+            status = IB_NOT_FOUND;
+      }
       break;
 
     case 2:    /* exact match */
-      if( pkt_life != required_pkt_life )
-        status = IB_NOT_FOUND;
+      if( pkt_life < required_pkt_life )
+         status = IB_NOT_FOUND;
+      else
+         pkt_life = required_pkt_life;
       break;
 
     case 3:    /* smallest available */


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ibisTestPathRec.tcl
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061224/56740f91/attachment.ksh>

From mictop_eagle at yahoo.ca  Sat Dec 23 05:22:43 2006
From: mictop_eagle at yahoo.ca (MR.MICHAEL THOMPSOM)
Date: Sat, 23 Dec 2006 17:22:43 +0400
Subject: [openib-general] OFFICE OF THE DIRECTOR
Message-ID: <20061224131429.7DDD93B0006@sentry-two.sandia.gov>

Dear Director,

My name is Mr.Michael Thompson;I work in a gold mining company in GHANA West Africa.There is a polititian friend of my  from  Republic of SIERA-LEAON one of former Ministers, during president charles Tialor government in office who has in our custody fund cash {for safekeeping in our company}.I received instruction from him to look for a reliable foreign partner/investor who can receive and manage the fund for him until his ordeal with the Government is over, currently he is under detention and probe, his offence is political motivated {he is aspiring for the office of the president come next election} However the money originated from gratification/under the counter sales of copper and diamond in his ministry,The amount is $58,600,000.00 USD.
It is upon this facts that I made a tripe to DUBAI in UNITED ARABS EMIRATES (UAE)

If you  can work with me and render  your good help, honestly you are welcome.Kindly help,you will be adequately rewarded for assisting.
Regards.
Mr.M.Thompson.


From halr at voltaire.com  Sun Dec 24 05:36:23 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 24 Dec 2006 08:36:23 -0500
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <458E7402.4000106@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il>
Message-ID: <1166967379.4519.320031.camel@hal.voltaire.com>

Hi Eitan,

On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote:
> Hi Or,
> 
> Sorry it took me a while.
> 
> According to the IBTA spec:
> 1. In order for MTU and MTUSelector to have any effect their component 
> mask bits MUST be set to 1 in the query
> 2. Behavior of the SM is defined with small "freedom" to choose between 
> multiple matching MTU values if they exist.

I agree in general but would like to be sure about the details. Please
be specific as to what IBA spec text you are referring to.

> 3. The table below summarizes all options:
> 
> Assuming the value M represents the lowest MTU on the path

Is M the lowest available MTU or the highest available MTU for that path
?

> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
> R represents the MTU value in the request. Similarly R-1 is one below R 
> and R+1 is one above R.
> 
> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
> w. Tavor End Port
> -----------------------------------------------------------------------------------------
> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
                          ^^^^^^^^
For the R> spec response column, I think you are saying the same as:
                       >R AND <=M if M>R /ERR
                         or
                       R < x <=M if M>R /ERR
                       where x is resp value

I agree with this table given the redefinition of M above and R > spec
response interpretation.

-- Hal

> I have built some test code for making sure OpenSM does what is required.
> Apparently it does not. In any case the M is not identical to R it fails 
> the request.
> 
> I am working on fixing OpenSM.
> 
> Any comments are welcome.
> 
> EZ
> 
> Or Gerlitz wrote:
> > Michael S. Tsirkin wrote:
> >   
> >> I am not yet sure what is best for upstream, so I don't really think we need
> >> any RFCs.
> >>     
> >
> >   
> >> We'll need data from SM guys on whether MTU selector actually works
> >> in SMs, and if not what happens when you enable it.
> >>     
> >
> > Eitan,
> >
> > Can you please post here the tavor-quirk patch which was integrated into 
> > opensm? i can see the ***code*** of the opensm but might make some wrong 
> > assumptions or get into wrong understandings as i am not able to see the 
> > patch as is.
> >
> > Or.
> >
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From sashak at voltaire.com  Sun Dec 24 09:02:48 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 24 Dec 2006 19:02:48 +0200
Subject: [openib-general] [PATCH TRIVIAL] opensm: remove unused local
	variable
Message-ID: <20061224170248.GA7111@sashak.voltaire.com>


Remove unused local variable.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_node.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c
index aba2e39..684eee6 100644
--- a/osm/opensm/osm_node.c
+++ b/osm/opensm/osm_node.c
@@ -97,7 +97,6 @@ osm_node_new(
   osm_node_t         *p_node;
   ib_smp_t        *p_smp;
   ib_node_info_t     *p_ni;
-  uint8_t            port_num;
   uint8_t            i;
   uint32_t        size;
 
@@ -108,7 +107,6 @@ osm_node_new(
   CL_ASSERT( p_smp->attr_id == IB_MAD_ATTR_NODE_INFO );
 
   p_ni = (ib_node_info_t*)ib_smp_get_payload_ptr( p_smp );
-  port_num = ib_node_info_get_local_port_num( p_ni );
 
   /*
     The node object already contains one physical port object.
-- 
1.4.4.2.gfc82d


From sashak at voltaire.com  Sun Dec 24 09:03:29 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 24 Dec 2006 19:03:29 +0200
Subject: [openib-general] [PATCH] opensm: rwlock double-release fix.
Message-ID: <20061224170329.GB7111@sashak.voltaire.com>


When the port is removed from subnet, but previously requested pkey
table block is received after this - the lock will be released twice.
This leads to deadlocks later when other MAD processor will try to
acquire the same lock.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_pkey_rcv.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/osm/opensm/osm_pkey_rcv.c b/osm/opensm/osm_pkey_rcv.c
index 3fc7673..3c18fcd 100644
--- a/osm/opensm/osm_pkey_rcv.c
+++ b/osm/opensm/osm_pkey_rcv.c
@@ -146,7 +146,6 @@ osm_pkey_rcv_process(
 
   if( p_port == (osm_port_t*)cl_qmap_end( p_guid_tbl) )
   {
-    cl_plock_release( p_rcv->p_lock );
     osm_log( p_rcv->p_log, OSM_LOG_ERROR,
              "osm_pkey_rcv_process: ERR 4806: "
              "No port object for port with GUID 0x%" PRIx64
@@ -219,4 +218,3 @@ osm_pkey_rcv_process(
 
   OSM_LOG_EXIT( p_rcv->p_log );
 }
-
-- 
1.4.4.2.gfc82d


From sashak at voltaire.com  Sun Dec 24 09:43:15 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Sun, 24 Dec 2006 19:43:15 +0200
Subject: [openib-general] [PATCH] opensm: clean old references on ports
	linking
Message-ID: <20061224174315.GC7111@sashak.voltaire.com>


When linking ports, cleanup old remote references. Without it the ports
still be accessible as "linked" from old neighbors and in case of ports
moving, when some MADs can be lost or reordered, OpenSM subnet data
structures become broken.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 osm/opensm/osm_node.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/osm/opensm/osm_node.c b/osm/opensm/osm_node.c
index 684eee6..6e72b58 100644
--- a/osm/opensm/osm_node.c
+++ b/osm/opensm/osm_node.c
@@ -195,6 +195,11 @@ osm_node_link(
   p_remote_physp =  osm_node_get_physp_ptr( p_remote_node,
                                             remote_port_num );
 
+  if (p_physp->p_remote_physp)
+    p_physp->p_remote_physp->p_remote_physp = NULL;
+  if (p_remote_physp->p_remote_physp)
+    p_remote_physp->p_remote_physp->p_remote_physp = NULL;
+
   osm_physp_link( p_physp, p_remote_physp );
 }
 
-- 
1.4.4.2.gfc82d


From eitan at mellanox.co.il  Sun Dec 24 10:39:06 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sun, 24 Dec 2006 20:39:06 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <1166967379.4519.320031.camel@hal.voltaire.com>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il>
	<1166967379.4519.320031.camel@hal.voltaire.com>
Message-ID: <458EC94A.2050808@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote:
>   
>> Hi Or,
>>
>> Sorry it took me a while.
>>
>> According to the IBTA spec:
>> 1. In order for MTU and MTUSelector to have any effect their component 
>> mask bits MUST be set to 1 in the query
>> 2. Behavior of the SM is defined with small "freedom" to choose between 
>> multiple matching MTU values if they exist.
>>     
>
> I agree in general but would like to be sure about the details. Please
> be specific as to what IBA spec text you are referring to.
>   
The text is part of the PathRecord table.
>   
>> 3. The table below summarizes all options:
>>
>> Assuming the value M represents the lowest MTU on the path
>>     
>
> Is M the lowest available MTU or the highest available MTU for that path
> ?
>   
M is the lowest MTU reported by all PortInfo for ports on the path.
>   
>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
>> R represents the MTU value in the request. Similarly R-1 is one below R 
>> and R+1 is one above R.
>>
>> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
>> w. Tavor End Port
>> -----------------------------------------------------------------------------------------
>> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
>> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
>> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
>> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
>>     
>                           ^^^^^^^^
> For the R> spec response column, I think you are saying the same as:
>                        >R AND <=M if M>R /ERR
>                          or
>                        R < x <=M if M>R /ERR
>                        where x is resp value
>   
Yes that is what I mean: the response value MUST be both bigger then R 
and equal or less to M. Otherwise an error.
> I agree with this table given the redefinition of M above and R > spec
> response interpretation.
>   
Good.
> -- Hal
>
>   
>> I have built some test code for making sure OpenSM does what is required.
>> Apparently it does not. In any case the M is not identical to R it fails 
>> the request.
>>
>> I am working on fixing OpenSM.
>>
>> Any comments are welcome.
>>
>> EZ
>>
>> Or Gerlitz wrote:
>>     
>>> Michael S. Tsirkin wrote:
>>>   
>>>       
>>>> I am not yet sure what is best for upstream, so I don't really think we need
>>>> any RFCs.
>>>>     
>>>>         
>>>   
>>>       
>>>> We'll need data from SM guys on whether MTU selector actually works
>>>> in SMs, and if not what happens when you enable it.
>>>>     
>>>>         
>>> Eitan,
>>>
>>> Can you please post here the tavor-quirk patch which was integrated into 
>>> opensm? i can see the ***code*** of the opensm but might make some wrong 
>>> assumptions or get into wrong understandings as i am not able to see the 
>>> patch as is.
>>>
>>> Or.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> openib-general mailing list
>>> openib-general at openib.org
>>> http://openib.org/mailman/listinfo/openib-general
>>>
>>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>>   
>>>       
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Sun Dec 24 13:21:13 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 24 Dec 2006 23:21:13 +0200
Subject: [openib-general] [PATCH] mm: fix page_mkclean_one (was: 2.6.19
 file content corruption on ext3)
In-Reply-To: <Pine.LNX.4.64.0612241029460.3671@woody.osdl.org>
References: <Pine.LNX.4.64.0612241029460.3671@woody.osdl.org>
Message-ID: <20061224212113.GA31813@mellanox.co.il>

> Quoting Linus Torvalds <torvalds at osdl.org>:
> Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
>
> Peter, tell me I'm crazy, but with the new rules, the following condition 
> is a bug:
> 
>  - shared mapping
>  - writable
>  - not already marked dirty in the PTE
> 
> because that combination means that the hardware can mark the PTE dirty 
> without us even realizing (and thus not marking the "struct page *" 
> dirty).

Er.
Sorry about bumping in, and I'm not sure I understand all of the discussion,
but this reminded me of an old issue with COW that created what looks
like a vaguely similiar data corruption on infiniband. We solved this for
infiniband with MADV_DONTFORK, but I always wondered why does it not affect
other parts of kernel.  Small reminder from that discussion:

down mmap sem
get user pages
up mmap sem
page becomes shared, and COW (e.g. fork)
process writes to first byte of page <----- gets a copy
Now we had a problem: struct page that we got from get user pages
does not point to a correct page in our process.
For example: if at some point we map this page for DMA, and
hardware writes to last byte of page -----> process does not
see this data.

So for infiniband, what we do is a combination of
- prevent page from becoming COW while hardware might DMA to this page, and
- ask users not to write to page if hardware might DMA to same page
  (even if its using different bytes).

I just wandered - is there some chance something like this could be happening in
the fs code?

HTH,

-- 
MST


From eitan at sw053.yok.mtl.com  Sun Dec 24 22:26:04 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Mon, 25 Dec 2006 08:26:04 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-25:normal completion
Message-ID: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 
ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 MOD_FILES=2
Total=216 Pass=215 Fail=1

Pass:
24 Stability IS1-16.topo
24 Pkey IS1-16.topo
24 OsmTest IS1-16.topo
24 Multicast IS1-16.topo
24 LidMgr IS1-16.topo
23 OsmStress IS1-16.topo
8 Stability IS3-loop.topo
8 Stability IS3-128.topo
8 Pkey IS3-128.topo
8 OsmTest IS3-loop.topo
8 OsmTest IS3-128.topo
8 OsmStress IS3-128.topo
8 Multicast IS3-loop.topo
8 Multicast IS3-128.topo
8 LidMgr IS3-128.topo

Failures:
1 OsmStress IS1-16.topo


From eitan at mellanox.co.il  Sun Dec 24 22:32:44 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 25 Dec 2006 08:32:44 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-25:normal
 completion
In-Reply-To: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com>
References: <200612250626.kBP6Q4Sp025341@sw053.yok.mtl.com>
Message-ID: <458F708C.9070008@mellanox.co.il>

The run that fail is caused by a deadlock in the simulator causing 
ibdiagnet to fail to start.
I'm still looking for that deadlock.

Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 
> ibutils rev = Sat_Dec_23_17:47:24_2006 2057e4 MOD_FILES=2
> Total=216 Pass=215 Fail=1
>
> Pass:
> 24 Stability IS1-16.topo
> 24 Pkey IS1-16.topo
> 24 OsmTest IS1-16.topo
> 24 Multicast IS1-16.topo
> 24 LidMgr IS1-16.topo
> 23 OsmStress IS1-16.topo
> 8 Stability IS3-loop.topo
> 8 Stability IS3-128.topo
> 8 Pkey IS3-128.topo
> 8 OsmTest IS3-loop.topo
> 8 OsmTest IS3-128.topo
> 8 OsmStress IS3-128.topo
> 8 Multicast IS3-loop.topo
> 8 Multicast IS3-128.topo
> 8 LidMgr IS3-128.topo
>
> Failures:
> 1 OsmStress IS1-16.topo
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From jackm at dev.mellanox.co.il  Sun Dec 24 23:24:52 2006
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Mon, 25 Dec 2006 09:24:52 +0200
Subject: [openib-general] [PATCH] mthca: fix PRM compliance problem in
 atomic-send completions
Message-ID: <200612250924.52746.jackm@dev.mellanox.co.il>

For Atomic-Send completions, according to the Tavor and Arbel
PRMs, the number of bytes transferred is not provided in the byte_cnt
field of the cqe. For atomic operations, the number of bytes transferred
is always 8 (when the status is "success"), and this value should be
inserted by the driver in the ib_wc entry returned to the poller.

Signed-off-by: Jack Morgenstein <jackm at dev.mellanox.co.il>

diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 8293c15..8fd0e9e 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -54,6 +54,10 @@ enum {
 	MTHCA_CQ_ENTRY_SIZE = 0x20
 };
 
+enum {
+	MTHCA_ATOMIC_BYTE_LEN = 8
+};
+
 /*
  * Must be packed because start is 64 bits but only aligned to 32 bits.
  */
@@ -598,11 +602,11 @@ static inline int mthca_poll_one(struct mthca_dev *dev,
 			break;
 		case MTHCA_OPCODE_ATOMIC_CS:
 			entry->opcode    = IB_WC_COMP_SWAP;
-			entry->byte_len  = be32_to_cpu(cqe->byte_cnt);
+			entry->byte_len  = MTHCA_ATOMIC_BYTE_LEN;
 			break;
 		case MTHCA_OPCODE_ATOMIC_FA:
 			entry->opcode    = IB_WC_FETCH_ADD;
-			entry->byte_len  = be32_to_cpu(cqe->byte_cnt);
+			entry->byte_len  = MTHCA_ATOMIC_BYTE_LEN;
 			break;
 		case MTHCA_OPCODE_BIND_MW:
 			entry->opcode    = IB_WC_BIND_MW;


From ogerlitz at voltaire.com  Mon Dec 25 04:03:47 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 25 Dec 2006 14:03:47 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <458E7402.4000106@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il>
Message-ID: <458FBE23.4090100@voltaire.com>

Eitan Zahavi wrote:
> According to the IBTA spec:
> 1. In order for MTU and MTUSelector to have any effect their component 
> mask bits MUST be set to 1 in the query
> 2. Behavior of the SM is defined with small "freedom" to choose between 
> multiple matching MTU values if they exist.
> 3. The table below summarizes all options:
> 
> Assuming the value M represents the lowest MTU on the path
> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
> R represents the MTU value in the request. Similarly R-1 is one below R 
> and R+1 is one above R.
> 
> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
> w. Tavor End Port
> ----------------------------------------------------------------------------------------- 
> 
> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
> 
> I have built some test code for making sure OpenSM does what is required.
> Apparently it does not. In any case the M is not identical to R it fails 
> the request.
> 
> I am working on fixing OpenSM.
> 
> Any comments are welcome.

OK Eitan, thanks for putting the time on this, we will be able to 
provide feedback tomorrow or later this week.

Is the direction suggested by this matrix and patch that follows 
orthogonal to the open-sm tavor quirk patch?

Or.


From ogerlitz at voltaire.com  Mon Dec 25 04:25:22 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Mon, 25 Dec 2006 14:25:22 +0200
Subject: [openib-general] No resource tracking per qp for multicast
 groups
In-Reply-To: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com>
Message-ID: <458FC332.1010801@voltaire.com>

Jack Morgenstein wrote:
> I noticed that there is no per-qp tracking of multicast groups of which 
> it is a member. Thus, for example, if a user-space app dies without a 
> chance to perform its cleanup, the (nonexistent) QP will still be listed 
> in the HCA firmware as a member of its multicast groups.  This has 2 
> effects:
> 
> 1. The effective number of qp's which can join that multicast group has 
> been reduced (since it has zombie entries).
> 2. If the above QP gets re-used, it will still be a member of the 
> multicast groups (and therefore receive packets which were not intended 
> for the new user of the QP).
> 
> I suggest tracking mcast group membership in kernel-space only. If we 
> don't wish to change the verbs layer behavior, we can just detach a qp 
> from all its multicast groups (if any) in ib_destroy_qp (although this 
> is not IB compliant -- see IB Spec 11.2.4.4 (we should return an error 
> in this case)).  Otherwise, I think we'll need something messy  (such as 
> an ib_verbs layer function requesting a QP to detach from all its 
> multicast groups.
> 
> My preference is to leave the verbs layer alone as much as possible.  
> Track the multicast group membership per qp (gid and lid) in struct 
> ib_qp, and make calls in ib_destroy_qp() to ib_mcast_detach().

Jack, I just have came across this Nov 2005 post of yours...

Per my understanding the issues you describe here are orthogonal to 
Sean's multicast work, correct? were they solved in mthca or its still 
open?

Or.


From mst at mellanox.co.il  Mon Dec 25 06:18:49 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 25 Dec 2006 16:18:49 +0200
Subject: [openib-general] [PATCH/RFC] libibverbs: Improve driver loading
In-Reply-To: <adahcwk1uzj.fsf@cisco.com>
References: <adahcwk1uzj.fsf@cisco.com>
Message-ID: <20061225141849.GI842@mellanox.co.il>


> OK, at last here is a patch that implements the improvements to
> libibverbs driver loading that we discussed back in October.
> 
> With this patch, instead of trying all the .so files in the
> $(libdir)/infiniband directory as libibverbs 1.0 does, libibverbs
> instead builds a list of drivers to load and dlopen() exactly that
> list of libraries.  It uses relative paths rather than absolute paths,
> so the linker uses the normal search path to find driver libraries.
> 
> (To get a list of drivers, libibverbs parses all the config files it
> finds in $(sysconfdir)/libibverbs.d and also looks at the environment
> variables RDMAV_DRIVERS and IBV_DRIVERS)
> 
> Then, instead of calling a specific entry point in the driver,
> libibverbs assumes the driver will call ibv_register_driver() from an
> __attribute__((constructor)) function.
> 
> This has a number of benefits:
>  - multiple drivers can be linked statically into an executable
>  - LD_LIBRARY_PATH can be used to manage which drivers to load
>  - different versions of the driver can be selected automagically at
>    runtime (eg i686/cmov on i386 distros)
> 
> I will post a libmthca patch to illustrate how driver libraries need
> to change to work with this new libibverbs method.

I think this looked good, and probably best to do before the next
major release.

Do you plan to merge this?

-- 
MST


From yosefe at voltaire.com  Mon Dec 25 07:29:43 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 25 Dec 2006 17:29:43 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
Message-ID: <458FEE67.2080003@voltaire.com>

Hello,
I've been testing ofed 1.2 build from 
http://staging.openfabrics.org/builds/ 
<http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
and kernel) and got compilation erros on: ia64, ppc64:

*ppc64:*

    make -w -C ip ip
    make[2]: Entering directory
    `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
    [ ... omitted text ... ]
    gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
    -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
    gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
    rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
    ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
    xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
    -lresolv -L../lib -lnetlink -lutil -o ip
    /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
    searching for -lnetlink
    /usr/bin/ld: skipping incompatible
    /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
    searching for -lnetlink
    /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
    searching for -lnetlink
    /usr/bin/ld: cannot find -lnetlink
    collect2: ld returned 1 exit status
    make[2]: *** [ip] Error 1

possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
CFLAGS (= instead of +=)

*ia64:*

    make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
    obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
    gcc [ ... omitted text ... ] -c -o
    /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
    In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
    from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
    ‘ib_sg_dma_address’:
    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
    implicit declaration of function ‘sg_dma_address’
    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
    ‘ib_sg_dma_len’:
    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
    implicit declaration of function ‘sg_dma_len’
    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
    initialization from incompatible pointer type
    [ ... omitted text ... ]
    make: *** [kernel] Error 2


Yossi


From mst at mellanox.co.il  Mon Dec 25 07:46:54 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 25 Dec 2006 17:46:54 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <458FEE67.2080003@voltaire.com>
References: <458FEE67.2080003@voltaire.com>
Message-ID: <20061225154654.GG4741@mellanox.co.il>

> Quoting r. Yosef Etigin <yosefe at voltaire.com>:
> Subject: ofed 1.2 - compilation erros on ppc64 and ia64

Which distro are you testing on?

> Hello,
> I've been testing ofed 1.2 build from 
> http://staging.openfabrics.org/builds/ 
> <http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
> and kernel) and got compilation erros on: ia64, ppc64:
> 
> *ppc64:*
> 
>     make -w -C ip ip
>     make[2]: Entering directory
>     `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
>     [ ... omitted text ... ]
>     gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
>     -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
>     gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
>     rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
>     ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
>     xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
>     -lresolv -L../lib -lnetlink -lutil -o ip
>     /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
>     searching for -lnetlink
>     /usr/bin/ld: skipping incompatible
>     /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
>     searching for -lnetlink
>     /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
>     searching for -lnetlink
>     /usr/bin/ld: cannot find -lnetlink
>     collect2: ld returned 1 exit status
>     make[2]: *** [ip] Error 1
> 
> possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
> CFLAGS (= instead of +=)

Isn't this makefile part of iproute2?
Can you build iproute on this platform?
	
> *ia64:*
> 
>     make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
>     obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
>     gcc [ ... omitted text ... ] -c -o
>     /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
>     /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
>     In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
>     from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
>     /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>     ‘ib_sg_dma_address’:
>     /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
>     implicit declaration of function ‘sg_dma_address’
>     /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>     ‘ib_sg_dma_len’:
>     /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
>     implicit declaration of function ‘sg_dma_len’
>     /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
>     /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
>     initialization from incompatible pointer type
>     [ ... omitted text ... ]
>     make: *** [kernel] Error 2

Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
I see this on upstream 2.6.16
	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)


-- 
MST


From yosefe at voltaire.com  Mon Dec 25 08:11:02 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 25 Dec 2006 18:11:02 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <20061225154654.GG4741@mellanox.co.il>
References: <458FEE67.2080003@voltaire.com>
	<20061225154654.GG4741@mellanox.co.il>
Message-ID: <458FF816.3010800@voltaire.com>

Michael S. Tsirkin wrote:

>>Quoting r. Yosef Etigin <yosefe at voltaire.com>:
>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64
>>    
>>
>
>Which distro are you testing on?
>
>  
>
I am testing on sles10, both ia64 and ppc64.

>>Hello,
>>I've been testing ofed 1.2 build from 
>>http://staging.openfabrics.org/builds/ 
>><http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
>>and kernel) and got compilation erros on: ia64, ppc64:
>>
>>*ppc64:*
>>
>>    make -w -C ip ip
>>    make[2]: Entering directory
>>    `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
>>    [ ... omitted text ... ]
>>    gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
>>    -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
>>    gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
>>    rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
>>    ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
>>    xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
>>    -lresolv -L../lib -lnetlink -lutil -o ip
>>    /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
>>    searching for -lnetlink
>>    /usr/bin/ld: skipping incompatible
>>    /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
>>    searching for -lnetlink
>>    /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
>>    searching for -lnetlink
>>    /usr/bin/ld: cannot find -lnetlink
>>    collect2: ld returned 1 exit status
>>    make[2]: *** [ip] Error 1
>>
>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
>>CFLAGS (= instead of +=)
>>    
>>
>
>Isn't this makefile part of iproute2?
>Can you build iproute on this platform?
>  
>
This makefile is indeed of iproute,
but it seems to make 32-bit object files for `iproute' during compilation
and therefore fails to find 64-bit during linkage of `ip'.

>	
>  
>
>>*ia64:*
>>
>>    make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
>>    obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
>>    gcc [ ... omitted text ... ] -c -o
>>    /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
>>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
>>    In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
>>    from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
>>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>    ‘ib_sg_dma_address’:
>>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
>>    implicit declaration of function ‘sg_dma_address’
>>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>    ‘ib_sg_dma_len’:
>>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
>>    implicit declaration of function ‘sg_dma_len’
>>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
>>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
>>    initialization from incompatible pointer type
>>    [ ... omitted text ... ]
>>    make: *** [kernel] Error 2
>>    
>>
>
>Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
>I see this on upstream 2.6.16
>	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)
>  
>
Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere 
else in this file, but in:
        ./asm-ia64/pci.h:82:#define sg_dma_len(sg)              
((sg)->dma_length)


Yossi


From mst at mellanox.co.il  Mon Dec 25 08:24:23 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 25 Dec 2006 18:24:23 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <458FF816.3010800@voltaire.com>
References: <458FF816.3010800@voltaire.com>
Message-ID: <20061225162423.GI4741@mellanox.co.il>

Mutt Label Removed By VIM
Quoting r. Yosef Etigin <yosefe at voltaire.com>:
Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64

Michael S. Tsirkin wrote:

> >>Quoting r. Yosef Etigin <yosefe at voltaire.com>:
> >>Subject: ofed 1.2 - compilation erros on ppc64 and ia64
> >>    
> >>
> >
> >Which distro are you testing on?
> >
> >  
> >
> I am testing on sles10, both ia64 and ppc64.
> 
> >>Hello,
> >>I've been testing ofed 1.2 build from 
> >>http://staging.openfabrics.org/builds/ 
> >><http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
> >>and kernel) and got compilation erros on: ia64, ppc64:
> >>
> >>*ppc64:*
> >>
> >>    make -w -C ip ip
> >>    make[2]: Entering directory
> >>    `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
> >>    [ ... omitted text ... ]
> >>    gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
> >>    -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
> >>    gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
> >>    rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
> >>    ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
> >>    xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
> >>    -lresolv -L../lib -lnetlink -lutil -o ip
> >>    /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
> >>    searching for -lnetlink
> >>    /usr/bin/ld: skipping incompatible
> >>    /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
> >>    searching for -lnetlink
> >>    /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
> >>    searching for -lnetlink
> >>    /usr/bin/ld: cannot find -lnetlink
> >>    collect2: ld returned 1 exit status
> >>    make[2]: *** [ip] Error 1
> >>
> >>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
> >>CFLAGS (= instead of +=)
> >>    
> >>
> >
> >Isn't this makefile part of iproute2?
> >Can you build iproute on this platform?
> >  
> >
> This makefile is indeed of iproute,
> but it seems to make 32-bit object files for `iproute' during compilation
> and therefore fails to find 64-bit during linkage of `ip'.

Will installing the 32 bit version of the library help?

> >	
> >  
> >
> >>*ia64:*
> >>
> >>    make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
> >>    obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
> >>    gcc [ ... omitted text ... ] -c -o
> >>    /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
> >>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
> >>    In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
> >>    from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
> >>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
> >>    ‘ib_sg_dma_address’:
> >>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
> >>    implicit declaration of function ‘sg_dma_address’
> >>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
> >>    ‘ib_sg_dma_len’:
> >>    /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
> >>    implicit declaration of function ‘sg_dma_len’
> >>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
> >>    /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
> >>    initialization from incompatible pointer type
> >>    [ ... omitted text ... ]
> >>    make: *** [kernel] Error 2
> >>    
> >>
> >
> >Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
> >I see this on upstream 2.6.16
> >	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)
> >  
> >
> Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere 
> else in this file, but in:
>         ./asm-ia64/pci.h:82:#define sg_dma_len(sg)    ((sg)->dma_length)
> 

Isee, its fixed on 2.6.20.
Need to do something about it in the backport then.

I wonder whether we can just put
#ifdef __ia64__
#define sg_dma_len(sg)          ((sg)->dma_length)
#endif

in kernel_addons/backports/2.6.16/include/asm/scatterlist.h

Also need tofind out in which kernel this was fixed.

-- 
MST


From yosefe at voltaire.com  Mon Dec 25 09:49:57 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Mon, 25 Dec 2006 19:49:57 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <20061225162423.GI4741@mellanox.co.il>
References: <458FF816.3010800@voltaire.com>
	<20061225162423.GI4741@mellanox.co.il>
Message-ID: <45900F45.50906@voltaire.com>

Michael S. Tsirkin wrote:
> Mutt Label Removed By VIM
> Quoting r. Yosef Etigin <yosefe at voltaire.com>:
> Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64
> 
> Michael S. Tsirkin wrote:
> 
> 
>>>>Quoting r. Yosef Etigin <yosefe at voltaire.com>:
>>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64
>>>>   
>>>>
>>>
>>>Which distro are you testing on?
>>>
>>> 
>>>
>>
>>I am testing on sles10, both ia64 and ppc64.
>>
>>
>>>>Hello,
>>>>I've been testing ofed 1.2 build from 
>>>>http://staging.openfabrics.org/builds/ 
>>>><http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
>>>>and kernel) and got compilation erros on: ia64, ppc64:
>>>>
>>>>*ppc64:*
>>>>
>>>>   make -w -C ip ip
>>>>   make[2]: Entering directory
>>>>   `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
>>>>   [ ... omitted text ... ]
>>>>   gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
>>>>   -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
>>>>   gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
>>>>   rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
>>>>   ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
>>>>   xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
>>>>   -lresolv -L../lib -lnetlink -lutil -o ip
>>>>   /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
>>>>   searching for -lnetlink
>>>>   /usr/bin/ld: skipping incompatible
>>>>   /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
>>>>   searching for -lnetlink
>>>>   /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
>>>>   searching for -lnetlink
>>>>   /usr/bin/ld: cannot find -lnetlink
>>>>   collect2: ld returned 1 exit status
>>>>   make[2]: *** [ip] Error 1
>>>>
>>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
>>>>CFLAGS (= instead of +=)
>>>>   
>>>>
>>>
>>>Isn't this makefile part of iproute2?
>>>Can you build iproute on this platform?
>>> 
>>>
>>
>>This makefile is indeed of iproute,
>>but it seems to make 32-bit object files for `iproute' during compilation
>>and therefore fails to find 64-bit during linkage of `ip'.
> 
> 
> Will installing the 32 bit version of the library help?
> 
>

I dont think so.. the issue arised during compilation, since `iproute' 
was inconsinsten in its use of -m64:
The iproute Makefile overrides any `CFLAGS' it might get from top-level, 
thus throwing `-m64' away, while LDFLAGS are not overriden.
Therefore, the compilation is done in 32bit while the linkage in 64bit

>>>	
>>> 
>>>
>>>
>>>>*ia64:*
>>>>
>>>>   make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
>>>>   obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
>>>>   gcc [ ... omitted text ... ] -c -o
>>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
>>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
>>>>   In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
>>>>   from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
>>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>>>   ‘ib_sg_dma_address’:
>>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
>>>>   implicit declaration of function ‘sg_dma_address’
>>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>>>   ‘ib_sg_dma_len’:
>>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
>>>>   implicit declaration of function ‘sg_dma_len’
>>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
>>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
>>>>   initialization from incompatible pointer type
>>>>   [ ... omitted text ... ]
>>>>   make: *** [kernel] Error 2
>>>>   
>>>>
>>>
>>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
>>>I see this on upstream 2.6.16
>>>	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)
>>> 
>>>
>>
>>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere 
>>else in this file, but in:
>>        ./asm-ia64/pci.h:82:#define sg_dma_len(sg)    ((sg)->dma_length)
>>
> 
> 
> Isee, its fixed on 2.6.20.
> Need to do something about it in the backport then.
> 
> I wonder whether we can just put
> #ifdef __ia64__
> #define sg_dma_len(sg)          ((sg)->dma_length)
> #endif
> 
> in kernel_addons/backports/2.6.16/include/asm/scatterlist.h
> 
> Also need tofind out in which kernel this was fixed.
> 

Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to 
backtort to.. all previous versions

Yossi


From eitan at mellanox.co.il  Mon Dec 25 11:51:33 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Mon, 25 Dec 2006 21:51:33 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <458FBE23.4090100@voltaire.com>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il> <458FBE23.4090100@voltaire.com>
Message-ID: <45902BC5.4070107@mellanox.co.il>

Or Gerlitz wrote:
> Eitan Zahavi wrote:
>   
>> According to the IBTA spec:
>> 1. In order for MTU and MTUSelector to have any effect their component 
>> mask bits MUST be set to 1 in the query
>> 2. Behavior of the SM is defined with small "freedom" to choose between 
>> multiple matching MTU values if they exist.
>> 3. The table below summarizes all options:
>>
>> Assuming the value M represents the lowest MTU on the path
>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
>> R represents the MTU value in the request. Similarly R-1 is one below R 
>> and R+1 is one above R.
>>
>> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
>> w. Tavor End Port
>> ----------------------------------------------------------------------------------------- 
>>
>> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
>> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
>> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
>> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
>>
>> I have built some test code for making sure OpenSM does what is required.
>> Apparently it does not. In any case the M is not identical to R it fails 
>> the request.
>>
>> I am working on fixing OpenSM.
>>
>> Any comments are welcome.
>>     
>
> OK Eitan, thanks for putting the time on this, we will be able to 
> provide feedback tomorrow or later this week.
>
> Is the direction suggested by this matrix and patch that follows 
> orthogonal to the open-sm tavor quirk patch?
>   
The table above has a column named "OpenSM Quirk" which describes the 
expected result of the tavor quirk patch.
If that is not the outcome of that patch = it should be fixed. I am not 
proposing a new type of behavior - just to fix the existing one.
> Or.
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From hudaslalt at millic.com.ar  Mon Dec 25 14:02:58 2006
From: hudaslalt at millic.com.ar (Diana Furtado)
Date: Mon, 25 Dec 2006 21:02:58 -0100
Subject: [openib-general] Que tengas una feliz navidad
Message-ID: <9457164.6GG63504j@ciudad.com.ar>


Llegamos ya al final del año
Es momento de reflexionar
de hacer un balance de nuestros logros laborales
y de nuestras metas y objetivos

¿Lograste cumplir tus metas laborales este año?
Si la respuesta es NO... 
no es de extrañarse...
pocas personas lo logran en el esquema de empleo tradicional

¿Sentis que tenés un techo dentro de tu trabajo por encima del cual no podrás crecer nunca?
Se da a menudo

¿Te imponen los días de vacaciones?
Suele pasar

¿No estas el tiempo que quisieras con tu familia?
Es habitual

Yo padecí todas esas cosas, y muchas más, hasta que dije BASTA
Comencé a buscar un sistema de trabajo alternativo. 
Mi búsqueda no fue fácil, pero logré dar con una empresa seria
que me permitió despedir a mi jefe, trabajar en casa y pasar más tiempo con mi hijo.

Ya no vivo pendiente de si tendré mi puesto el mes que viene
o si le caigo o no bien a mi EX jefe

Ya no tengo que tomar 2 colectivos de ida y 2 de vuelta todos los días
Recién hoy día tomo conciencia del tiempo de mi vida que desperdiciaba viajando
Ahora mi puesto de trabajo esta en mi casa ¿genial no?

A mi me cambió la vida radicalmente en solo 10 meses
porque gano casi el doble que trabajando bajo patrón (y trabajo la mitad de las horas que solía trabajar)

Si te pasa lo mismo que me pasaba a mi
puedo ayudarte mostrándote lo que yo hago

¿Quién dijo que todo está perdido?
Yo hice el cambio a principios de 2006 y te estoy contando mi experiencia

El 2007 puede marcar tu cambio

mandame un mail a produccion_en_argent at fullzero.com.ar
y coloca en el asunto del correo electronico la frase " quie-ro mas infor-macion"

Te deseo feliz año

Diana Furtado
Si conocés a alguien a quien le interese hacele llegar este email

Un enorme abrazo para vos y para tu familia


From mst at mellanox.co.il  Mon Dec 25 15:00:51 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 01:00:51 +0200
Subject: [openib-general] ib_dma_addr_t
Message-ID: <20061225230051.GG17469@mellanox.co.il>

I'd like to propose that we introduce ib_dma_addr_t.
The idea is to add some type safety (via sparse checker)
that we lost when all addresses were converted to u64.

How does it sound?

-- 
MST


From mst at mellanox.co.il  Mon Dec 25 15:46:50 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 01:46:50 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <45900F45.50906@voltaire.com>
References: <45900F45.50906@voltaire.com>
Message-ID: <20061225234648.GJ17469@mellanox.co.il>

> > Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64
> > 
> > Michael S. Tsirkin wrote:
> > 
> > 
> >>>>Quoting r. Yosef Etigin <yosefe at voltaire.com>:
> >>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64
> >>>>   
> >>>>
> >>>
> >>>Which distro are you testing on?
> >>>
> >>> 
> >>>
> >>
> >>I am testing on sles10, both ia64 and ppc64.
> >>
> >>
> >>>>Hello,
> >>>>I've been testing ofed 1.2 build from 
> >>>>http://staging.openfabrics.org/builds/ 
> >>>><http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
> >>>>and kernel) and got compilation erros on: ia64, ppc64:
> >>>>
> >>>>*ppc64:*
> >>>>
> >>>>   make -w -C ip ip
> >>>>   make[2]: Entering directory
> >>>>   `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
> >>>>   [ ... omitted text ... ]
> >>>>   gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
> >>>>   -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
> >>>>   gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
> >>>>   rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
> >>>>   ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
> >>>>   xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
> >>>>   -lresolv -L../lib -lnetlink -lutil -o ip
> >>>>   /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
> >>>>   searching for -lnetlink
> >>>>   /usr/bin/ld: skipping incompatible
> >>>>   /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
> >>>>   searching for -lnetlink
> >>>>   /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
> >>>>   searching for -lnetlink
> >>>>   /usr/bin/ld: cannot find -lnetlink
> >>>>   collect2: ld returned 1 exit status
> >>>>   make[2]: *** [ip] Error 1
> >>>>
> >>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
> >>>>CFLAGS (= instead of +=)
> >>>>   
> >>>>
> >>>
> >>>Isn't this makefile part of iproute2?
> >>>Can you build iproute on this platform?
> >>> 
> >>>
> >>
> >>This makefile is indeed of iproute,
> >>but it seems to make 32-bit object files for `iproute' during compilation
> >>and therefore fails to find 64-bit during linkage of `ip'.
> > 
> > 
> > Will installing the 32 bit version of the library help?
> > 
> >
> 
> I dont think so.. the issue arised during compilation, since `iproute' 
> was inconsinsten in its use of -m64:
> The iproute Makefile overrides any `CFLAGS' it might get from top-level, 
> thus throwing `-m64' away, while LDFLAGS are not overriden.
> Therefore, the compilation is done in 32bit while the linkage in 64bit

Probably the easies thing is to fix iproute. Patch?

> >>>	
> >>> 
> >>>
> >>>
> >>>>*ia64:*
> >>>>
> >>>>   make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
> >>>>   obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
> >>>>   gcc [ ... omitted text ... ] -c -o
> >>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
> >>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
> >>>>   In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
> >>>>   from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
> >>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
> >>>>   ‘ib_sg_dma_address’:
> >>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
> >>>>   implicit declaration of function ‘sg_dma_address’
> >>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
> >>>>   ‘ib_sg_dma_len’:
> >>>>   /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
> >>>>   implicit declaration of function ‘sg_dma_len’
> >>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
> >>>>   /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
> >>>>   initialization from incompatible pointer type
> >>>>   [ ... omitted text ... ]
> >>>>   make: *** [kernel] Error 2
> >>>>   
> >>>>
> >>>
> >>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
> >>>I see this on upstream 2.6.16
> >>>	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)
> >>> 
> >>>
> >>
> >>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere 
> >>else in this file, but in:
> >>        ./asm-ia64/pci.h:82:#define sg_dma_len(sg)    ((sg)->dma_length)
> >>
> > 
> > 
> > Isee, its fixed on 2.6.20.
> > Need to do something about it in the backport then.
> > 
> > I wonder whether we can just put
> > #ifdef __ia64__
> > #define sg_dma_len(sg)          ((sg)->dma_length)
> > #endif
> > 
> > in kernel_addons/backports/2.6.16/include/asm/scatterlist.h
> > 
> > Also need tofind out in which kernel this was fixed.
> > 
> 
> Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to 
> backtort to.. all previous versions

Right. Try sticking this in kernel_addons/backports/2.6.20 and
copying it over.

-- 
MST


From eitan at sw053.yok.mtl.com  Mon Dec 25 21:08:10 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Tue, 26 Dec 2006 07:08:10 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-26:normal completion
Message-ID: <200612260508.kBQ58Afr019644@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Sun_Dec_24_08:19:04_2006 ef4b40 
ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b 
Total=351 Pass=350 Fail=1

Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 OsmTest IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
38 OsmStress IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo

Failures:
1 OsmStress IS1-16.topo


From mst at mellanox.co.il  Tue Dec 26 00:51:44 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 10:51:44 +0200
Subject: [openib-general] [PATCH/RFC] libibverbs: Improve driver loading
In-Reply-To: <20061225141849.GI842@mellanox.co.il>
References: <adahcwk1uzj.fsf@cisco.com> <20061225141849.GI842@mellanox.co.il>
Message-ID: <20061226085144.GA4325@mellanox.co.il>

> > (To get a list of drivers, libibverbs parses all the config files it
> > finds in $(sysconfdir)/libibverbs.d and also looks at the environment
> > variables RDMAV_DRIVERS and IBV_DRIVERS)
> > 
> > Then, instead of calling a specific entry point in the driver,
> > libibverbs assumes the driver will call ibv_register_driver() from an
> > __attribute__((constructor)) function.
> > 
> > This has a number of benefits:
> >  - multiple drivers can be linked statically into an executable
> >  - LD_LIBRARY_PATH can be used to manage which drivers to load
> >  - different versions of the driver can be selected automagically at
> >    runtime (eg i686/cmov on i386 distros)
> 

Wrt static linking: I see this warning when I link with -static:
: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
And it actually crashes inside dlopen on some platforms.

Would it be possible to add a configuration option to avoid using dlopen
for static apps? Or, maybe, it makes more sense to make an empty stub for libdl,
and ask apps to link with that?


-- 
MST


From mst at mellanox.co.il  Tue Dec 26 00:53:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 10:53:11 +0200
Subject: [openib-general] libsdp.conf placement
Message-ID: <20061226085311.GB4325@mellanox.co.il>

I noticed autotools have sysconfdir variable.
So it seems to me this would be the best, standard, place to keep the
libsdp.conf file.

Eitan?

-- 
Michael S. Tsirkin - Staff Engineer, Mellanox Technologies Ltd.


From eitan at mellanox.co.il  Tue Dec 26 03:49:12 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 26 Dec 2006 13:49:12 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <20061226085311.GB4325@mellanox.co.il>
References: <20061226085311.GB4325@mellanox.co.il>
Message-ID: <45910C38.1020606@mellanox.co.il>

Michael S. Tsirkin wrote:
> I noticed autotools have sysconfdir variable.
> So it seems to me this would be the best, standard, place to keep the
> libsdp.conf file.
>
> Eitan?
>
>   
Unfortunately autotools are not doing the right thing.
Quoting from libsdp Makefile.am:
AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"

And then internally in the port.c code:
#define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"

Somehow when you run ./configure you get $prefix/etc as the $sysconfdir


EZ


From mst at mellanox.co.il  Tue Dec 26 03:53:09 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 13:53:09 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <45910C38.1020606@mellanox.co.il>
References: <45910C38.1020606@mellanox.co.il>
Message-ID: <20061226115309.GE4325@mellanox.co.il>


> > I noticed autotools have sysconfdir variable.
> > So it seems to me this would be the best, standard, place to keep the
> > libsdp.conf file.
> >
> > Eitan?
> >
> >   
> Unfortunately autotools are not doing the right thing.
> Quoting from libsdp Makefile.am:
> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"
> 
> And then internally in the port.c code:
> #define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"
> 
> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir

So, that's what all other libraries that use autotools will get
(e.g. libibverbs) and that's the best default place then.

If we want to, OFED can override sysconfdir with a configure switch, can it not?

-- 
MST


From eitan at mellanox.co.il  Tue Dec 26 03:57:26 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 26 Dec 2006 13:57:26 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <20061226115309.GE4325@mellanox.co.il>
References: <45910C38.1020606@mellanox.co.il>
	<20061226115309.GE4325@mellanox.co.il>
Message-ID: <45910E26.70002@mellanox.co.il>

Michael S. Tsirkin wrote:
>>> I noticed autotools have sysconfdir variable.
>>> So it seems to me this would be the best, standard, place to keep the
>>> libsdp.conf file.
>>>
>>> Eitan?
>>>
>>>   
>>>       
>> Unfortunately autotools are not doing the right thing.
>> Quoting from libsdp Makefile.am:
>> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"
>>
>> And then internally in the port.c code:
>> #define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"
>>
>> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir
>>     
>
> So, that's what all other libraries that use autotools will get
> (e.g. libibverbs) and that's the best default place then.
>
> If we want to, OFED can override sysconfdir with a configure switch, can it not?
>   
Yes it can but some people might want to upgrade just libsdp. For those 
I would preferably use a more reasonable sysconfig then $prefix/etc

EZ


From mst at mellanox.co.il  Tue Dec 26 04:06:00 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 14:06:00 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <45910E26.70002@mellanox.co.il>
References: <45910E26.70002@mellanox.co.il>
Message-ID: <20061226120600.GF4325@mellanox.co.il>

> >>> I noticed autotools have sysconfdir variable.
> >>> So it seems to me this would be the best, standard, place to keep the
> >>> libsdp.conf file.
> >>>
> >>> Eitan?
> >>>
> >>>   
> >>>       
> >> Unfortunately autotools are not doing the right thing.
> >> Quoting from libsdp Makefile.am:
> >> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"
> >>
> >> And then internally in the port.c code:
> >> #define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"
> >>
> >> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir
> >>     
> >
> > So, that's what all other libraries that use autotools will get
> > (e.g. libibverbs) and that's the best default place then.
> >
> > If we want to, OFED can override sysconfdir with a configure switch, can it not?
> >   
> Yes it can but some people might want to upgrade just libsdp. For those 
> I would preferably use a more reasonable sysconfig then $prefix/etc

I think these people can use a configure switch, too (updating
just libsdp without OFED needs playing with configure switches anyway,
because of all the 64/32 bit situation).

My point is, let's not mess with the defaults unless strictly necessary -
otherwise libibverbs config is in one place, and libsdp is in another,
and its a mess.

-- 
MST


From eitan at mellanox.co.il  Tue Dec 26 04:10:58 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 26 Dec 2006 14:10:58 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <20061226120600.GF4325@mellanox.co.il>
References: <45910E26.70002@mellanox.co.il>
	<20061226120600.GF4325@mellanox.co.il>
Message-ID: <45911152.1000008@mellanox.co.il>

Michael S. Tsirkin wrote:
>>>>> I noticed autotools have sysconfdir variable.
>>>>> So it seems to me this would be the best, standard, place to keep the
>>>>> libsdp.conf file.
>>>>>
>>>>> Eitan?
>>>>>
>>>>>   
>>>>>       
>>>>>           
>>>> Unfortunately autotools are not doing the right thing.
>>>> Quoting from libsdp Makefile.am:
>>>> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"
>>>>
>>>> And then internally in the port.c code:
>>>> #define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"
>>>>
>>>> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir
>>>>     
>>>>         
>>> So, that's what all other libraries that use autotools will get
>>> (e.g. libibverbs) and that's the best default place then.
>>>
>>> If we want to, OFED can override sysconfdir with a configure switch, can it not?
>>>   
>>>       
>> Yes it can but some people might want to upgrade just libsdp. For those 
>> I would preferably use a more reasonable sysconfig then $prefix/etc
>>     
>
> I think these people can use a configure switch, too (updating
> just libsdp without OFED needs playing with configure switches anyway,
> because of all the 64/32 bit situation).
>
> My point is, let's not mess with the defaults unless strictly necessary -
> otherwise libibverbs config is in one place, and libsdp is in another,
> and its a mess.
>
>   
RPM making should use the --sysconfigdir option for configure.
Still the default is broken. I will probably find a way to fix that .. 
one day.

EZ


From mst at mellanox.co.il  Tue Dec 26 04:19:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 14:19:11 +0200
Subject: [openib-general] libsdp.conf placement
In-Reply-To: <45911152.1000008@mellanox.co.il>
References: <45911152.1000008@mellanox.co.il>
Message-ID: <20061226121911.GG4325@mellanox.co.il>

> >>>>> I noticed autotools have sysconfdir variable.
> >>>>> So it seems to me this would be the best, standard, place to keep the
> >>>>> libsdp.conf file.
> >>>>>
> >>>>> Eitan?
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>> Unfortunately autotools are not doing the right thing.
> >>>> Quoting from libsdp Makefile.am:
> >>>> AM_CFLAGS = -Wall -DSYSCONFDIR=\"$(sysconfdir)\"
> >>>>
> >>>> And then internally in the port.c code:
> >>>> #define LIBSDP_DEFAULT_CONFIG_FILE  SYSCONFDIR "/libsdp.conf"
> >>>>
> >>>> Somehow when you run ./configure you get $prefix/etc as the $sysconfdir
> >>>>     
> >>>>         
> >>> So, that's what all other libraries that use autotools will get
> >>> (e.g. libibverbs) and that's the best default place then.
> >>>
> >>> If we want to, OFED can override sysconfdir with a configure switch, can it not?
> >>>   
> >>>       
> >> Yes it can but some people might want to upgrade just libsdp. For those 
> >> I would preferably use a more reasonable sysconfig then $prefix/etc
> >>     
> >
> > I think these people can use a configure switch, too (updating
> > just libsdp without OFED needs playing with configure switches anyway,
> > because of all the 64/32 bit situation).
> >
> > My point is, let's not mess with the defaults unless strictly necessary -
> > otherwise libibverbs config is in one place, and libsdp is in another,
> > and its a mess.
> >
> >   

So we are in agreement libsdp will put its config file in $sysconfigdir,
and let packagers change where it points to?

> RPM making should use the --sysconfigdir option for configure.

OK, but if so it should do so for all libraries, not just libsdp. Right?

> Still the default is broken.

Looks like a matter of taste. What is important is to keep it consistent across
all libraries in OFED.

> I will probably find a way to fix that .. 
> one day.

But for now, it defaults to $prefix/etc and if we want, OFED will override that
as appropriate?

-- 
MST


From yosefe at voltaire.com  Tue Dec 26 07:58:36 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Tue, 26 Dec 2006 17:58:36 +0200
Subject: [openib-general] [PATCH] [MINOR] ipoibtools: fix compilation errors
	on ppc64
Message-ID: <1167148716.7006.17.camel@muscida>

Fix compilation errors of ipoibtools on ppc64 caused by 
overriding CFLAGS in the Makefile.

Signed-off-by: Yosef Etigin <yosefe at voltaire.com>

---
diff -ur a/src/userspace/ipoibtools/iproute2/Makefile b/src/userspace/ipoibtools/iproute2/Makefile
--- a/src/userspace/ipoibtools/iproute2/Makefile	2006-12-25 16:18:43.000000000 +0200
+++ b/src/userspace/ipoibtools/iproute2/Makefile	2006-12-25 15:54:40.000000000 +0200
@@ -22,7 +22,7 @@
 CC = gcc
 HOSTCC = gcc
 CCOPTS = -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall
-CFLAGS = $(CCOPTS) -I../include $(DEFINES)
+CFLAGS += $(CCOPTS) -I../include $(DEFINES)
 YACCFLAGS = -d -t -v
 
 LDLIBS += -L../lib -lnetlink -lutil

--
Yosef Etigin
Voltaire


From halr at voltaire.com  Tue Dec 26 09:27:40 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 12:27:40 -0500
Subject: [openib-general] [PATCH TRIVIAL] opensm: remove unused local
	variable
In-Reply-To: <20061224170248.GA7111@sashak.voltaire.com>
References: <20061224170248.GA7111@sashak.voltaire.com>
Message-ID: <1167154058.29620.1725.camel@hal.voltaire.com>

On Sun, 2006-12-24 at 12:02, Sasha Khapyorsky wrote: 
> Remove unused local variable.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Dec 26 09:28:03 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 12:28:03 -0500
Subject: [openib-general] [PATCH] opensm: rwlock double-release fix.
In-Reply-To: <20061224170329.GB7111@sashak.voltaire.com>
References: <20061224170329.GB7111@sashak.voltaire.com>
Message-ID: <1167154064.29620.1727.camel@hal.voltaire.com>

On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: 
> When the port is removed from subnet, but previously requested pkey
> table block is received after this - the lock will be released twice.
> This leads to deadlocks later when other MAD processor will try to
> acquire the same lock.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Dec 26 09:28:08 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 12:28:08 -0500
Subject: [openib-general] [PATCH] opensm: clean old references on ports
	linking
In-Reply-To: <20061224174315.GC7111@sashak.voltaire.com>
References: <20061224174315.GC7111@sashak.voltaire.com>
Message-ID: <1167154069.29620.1729.camel@hal.voltaire.com>

 On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote:
> When linking ports, cleanup old remote references. Without it the ports
> still be accessible as "linked" from old neighbors and in case of ports
> moving, when some MADs can be lost or reordered, OpenSM subnet data
> structures become broken.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Good catch.

Thanks. Applied.

-- Hal


From halr at voltaire.com  Tue Dec 26 09:28:12 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 12:28:12 -0500
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <458E7532.5030400@mellanox.co.il>
References: <458E7532.5030400@mellanox.co.il>
Message-ID: <1167154075.29620.1731.camel@hal.voltaire.com>

Hi Eitan,

On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote:
> Hi Hal,
> 
> OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
> query even though the selector might be allowing for selecting an
> appropriate value.
> 
> I have made the attached ibis based program for testing MTU select.
> 
> After this fix the following results are obtained for a case of
> path allowing maximal 2K MTU .
> 
> In standard mode:
> ------------------------------------------------------------
> MTU greater then ... 256     (0x01) ->  equal to ....... 2K
> MTU less then ...... 256     (0x41) ->  NO PATHS
> MTU equal to ....... 256     (0x81) ->  equal to ....... 256
> MTU largest possible 256     (0xc1) ->  equal to ....... 2K
> MTU greater then ... 512     (0x02) ->  equal to ....... 2K
> MTU less then ...... 512     (0x42) ->  equal to ....... 256
> MTU equal to ....... 512     (0x82) ->  equal to ....... 512
> MTU largest possible 512     (0xc2) ->  equal to ....... 2K
> MTU greater then ... 1K      (0x03) ->  equal to ....... 2K
> MTU less then ...... 1K      (0x43) ->  equal to ....... 512
> MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
> MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
> MTU greater then ... 2K      (0x04) ->  NO PATHS
> MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
> MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
> MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
> MTU greater then ... 4K      (0x05) ->  NO PATHS
> MTU less then ...... 4K      (0x45) ->  equal to ....... 2K
> MTU equal to ....... 4K      (0x85) ->  NO PATHS
> MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
> ============================================================
> 
> With enable_quirks (when one of the ends is a Tavor device):
> ------------------------------------------------------------
> MTU greater then ... 256     (0x01) ->  equal to ....... 1K
> MTU less then ...... 256     (0x41) ->  NO PATHS
> MTU equal to ....... 256     (0x81) ->  equal to ....... 256
> MTU largest possible 256     (0xc1) ->  equal to ....... 2K
> MTU greater then ... 512     (0x02) ->  equal to ....... 1K
> MTU less then ...... 512     (0x42) ->  equal to ....... 256
> MTU equal to ....... 512     (0x82) ->  equal to ....... 512
> MTU largest possible 512     (0xc2) ->  equal to ....... 2K
> MTU greater then ... 1K      (0x03) ->  NO PATHS
> MTU less then ...... 1K      (0x43) ->  equal to ....... 512
> MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
> MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
> MTU greater then ... 2K      (0x04) ->  NO PATHS
> MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
> MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
> MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
> MTU greater then ... 4K      (0x05) ->  NO PATHS
> MTU less then ...... 4K      (0x45) ->  equal to ....... 1K
> MTU equal to ....... 4K      (0x85) ->  NO PATHS
> MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
> ============================================================
> 
> Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>

Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks
which were applied by hand.

-- Hal


From halr at voltaire.com  Tue Dec 26 09:28:26 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 12:28:26 -0500
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <458EC94A.2050808@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il>
	<1166967379.4519.320031.camel@hal.voltaire.com>
	<458EC94A.2050808@mellanox.co.il>
Message-ID: <1167154101.29620.1733.camel@hal.voltaire.com>

On Sun, 2006-12-24 at 13:39, Eitan Zahavi wrote: 
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Sun, 2006-12-24 at 07:35, Eitan Zahavi wrote:
> >   
> >> Hi Or,
> >>
> >> Sorry it took me a while.
> >>
> >> According to the IBTA spec:
> >> 1. In order for MTU and MTUSelector to have any effect their component 
> >> mask bits MUST be set to 1 in the query
> >> 2. Behavior of the SM is defined with small "freedom" to choose between 
> >> multiple matching MTU values if they exist.
> >>     
> >
> > I agree in general but would like to be sure about the details. Please
> > be specific as to what IBA spec text you are referring to.
> >   
> The text is part of the PathRecord table.

Are you referring to the description of XXXSelector ?
   
> >> 3. The table below summarizes all options:
> >>
> >> Assuming the value M represents the lowest MTU on the path
> >>     
> >
> > Is M the lowest available MTU or the highest available MTU for that path
> > ?
> >   
> M is the lowest MTU reported by all PortInfo for ports on the path.
                  ^^^
              NeighborMTU

We are saying the same thing in different ways.

-- Hal

> >   
> >> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
> >> R represents the MTU value in the request. Similarly R-1 is one below R 
> >> and R+1 is one above R.
> >>
> >> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
> >> w. Tavor End Port
> >> --
> >> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
> >> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
> >> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
> >> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
> >>     
> >                           ^^^^^^^^
> > For the R> spec response column, I think you are saying the same as:
> >                        >R AND <=M if M>R /ERR
> >                          or
> >                        R < x <=M if M>R /ERR
> >                        where x is resp value
> >   
> Yes that is what I mean: the response value MUST be both bigger then R 
> and equal or less to M. Otherwise an error.
> > I agree with this table given the redefinition of M above and R > spec
> > response interpretation.
> >   
> Good.
> > -- Hal
> >
> >   
> >> I have built some test code for making sure OpenSM does what is required.
> >> Apparently it does not. In any case the M is not identical to R it fails 
> >> the request.
> >>
> >> I am working on fixing OpenSM.
> >>
> >> Any comments are welcome.
> >>
> >> EZ
> >>
> >> Or Gerlitz wrote:
> >>     
> >>> Michael S. Tsirkin wrote:
> >>>   
> >>>       
> >>>> I am not yet sure what is best for upstream, so I don't really think we need
> >>>> any RFCs.
> >>>>     
> >>>>         
> >>>   
> >>>       
> >>>> We'll need data from SM guys on whether MTU selector actually works
> >>>> in SMs, and if not what happens when you enable it.
> >>>>     
> >>>>         
> >>> Eitan,
> >>>
> >>> Can you please post here the tavor-quirk patch which was integrated into 
> >>> opensm? i can see the ***code*** of the opensm but might make some wrong 
> >>> assumptions or get into wrong understandings as i am not able to see the 
> >>> patch as is.
> >>>
> >>> Or.
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> openib-general mailing list
> >>> openib-general at openib.org
> >>> http://openib.org/mailman/listinfo/openib-general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>   
> >>>       
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   


From halr at voltaire.com  Tue Dec 26 10:46:44 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 13:46:44 -0500
Subject: [openib-general] [PATCH] opensm: rwlock double-release fix.
In-Reply-To: <1167154064.29620.1727.camel@hal.voltaire.com>
References: <20061224170329.GB7111@sashak.voltaire.com>
	<1167154064.29620.1727.camel@hal.voltaire.com>
Message-ID: <1167158802.29620.5949.camel@hal.voltaire.com>

On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
> On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: 
> > When the port is removed from subnet, but previously requested pkey
> > table block is received after this - the lock will be released twice.
> > This leads to deadlocks later when other MAD processor will try to
> > acquire the same lock.
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> Thanks. Applied.

Looks like this applied to OFED 1.1 as well.

-- Hal

> -- Hal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Tue Dec 26 10:47:39 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 13:47:39 -0500
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <1167154075.29620.1731.camel@hal.voltaire.com>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
Message-ID: <1167158845.29620.6014.camel@hal.voltaire.com>

Hi again Eitan,

On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
> Hi Eitan,
> 
> On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote:
> > Hi Hal,
> > 
> > OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
> > query even though the selector might be allowing for selecting an
> > appropriate value.
> > 
> > I have made the attached ibis based program for testing MTU select.
> > 
> > After this fix the following results are obtained for a case of
> > path allowing maximal 2K MTU .
> > 
> > In standard mode:
> > ------------------------------------------------------------
> > MTU greater then ... 256     (0x01) ->  equal to ....... 2K
> > MTU less then ...... 256     (0x41) ->  NO PATHS
> > MTU equal to ....... 256     (0x81) ->  equal to ....... 256
> > MTU largest possible 256     (0xc1) ->  equal to ....... 2K
> > MTU greater then ... 512     (0x02) ->  equal to ....... 2K
> > MTU less then ...... 512     (0x42) ->  equal to ....... 256
> > MTU equal to ....... 512     (0x82) ->  equal to ....... 512
> > MTU largest possible 512     (0xc2) ->  equal to ....... 2K
> > MTU greater then ... 1K      (0x03) ->  equal to ....... 2K
> > MTU less then ...... 1K      (0x43) ->  equal to ....... 512
> > MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
> > MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
> > MTU greater then ... 2K      (0x04) ->  NO PATHS
> > MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
> > MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
> > MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
> > MTU greater then ... 4K      (0x05) ->  NO PATHS
> > MTU less then ...... 4K      (0x45) ->  equal to ....... 2K
> > MTU equal to ....... 4K      (0x85) ->  NO PATHS
> > MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
> > ============================================================
> > 
> > With enable_quirks (when one of the ends is a Tavor device):
> > ------------------------------------------------------------
> > MTU greater then ... 256     (0x01) ->  equal to ....... 1K
> > MTU less then ...... 256     (0x41) ->  NO PATHS
> > MTU equal to ....... 256     (0x81) ->  equal to ....... 256
> > MTU largest possible 256     (0xc1) ->  equal to ....... 2K
> > MTU greater then ... 512     (0x02) ->  equal to ....... 1K
> > MTU less then ...... 512     (0x42) ->  equal to ....... 256
> > MTU equal to ....... 512     (0x82) ->  equal to ....... 512
> > MTU largest possible 512     (0xc2) ->  equal to ....... 2K
> > MTU greater then ... 1K      (0x03) ->  NO PATHS
> > MTU less then ...... 1K      (0x43) ->  equal to ....... 512
> > MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
> > MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
> > MTU greater then ... 2K      (0x04) ->  NO PATHS
> > MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
> > MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
> > MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
> > MTU greater then ... 4K      (0x05) ->  NO PATHS
> > MTU less then ...... 4K      (0x45) ->  equal to ....... 1K
> > MTU equal to ....... 4K      (0x85) ->  NO PATHS
> > MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
> > ============================================================
> > 
> > Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
> 
> Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks
> which were applied by hand.

Should this be applied for OFED 1.1 as well ?

-- Hal

> -- Hal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Tue Dec 26 10:47:26 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 13:47:26 -0500
Subject: [openib-general] [PATCH] opensm: clean old references on ports
 linking
In-Reply-To: <1167154069.29620.1729.camel@hal.voltaire.com>
References: <20061224174315.GC7111@sashak.voltaire.com>
	<1167154069.29620.1729.camel@hal.voltaire.com>
Message-ID: <1167158805.29620.5951.camel@hal.voltaire.com>

On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
>  On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote:
> > When linking ports, cleanup old remote references. Without it the ports
> > still be accessible as "linked" from old neighbors and in case of ports
> > moving, when some MADs can be lost or reordered, OpenSM subnet data
> > structures become broken.
> > 
> > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> 
> Good catch.
> 
> Thanks. Applied.

Looks like this applied to OFED 1.1 as well.

-- Hal

> -- Hal
> 
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From halr at voltaire.com  Tue Dec 26 11:15:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 14:15:27 -0500
Subject: [openib-general] Old svn repository access
Message-ID: <1167160526.29620.7478.camel@hal.voltaire.com>

Hi,

Thought the old svn repository was made RO. When I do a RO operation to
it, I get the following error:

svn log | more
(R)eject, accept (t)emporarily or accept (p)ermanently? svn: PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/management/diags/src/ibnetdiscover.c'
svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/diags/src/ibnetdiscover.c': 405 Method Not Allowed (https://openib.org)

Shouldn't this work ?

-- Hal


From mst at mellanox.co.il  Tue Dec 26 11:26:03 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 21:26:03 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <1167158845.29620.6014.camel@hal.voltaire.com>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
Message-ID: <20061226192603.GA4815@mellanox.co.il>


> Should this be applied for OFED 1.1 as well ?

There are a lot of other fixes all over the stack that might be
useful to people.
But first EWG needs to decide how OFED 1.1 support will be done.

For now, the only thing we have is the support wiki with links
to patches. So if there's a customer that is hit by one
of these bugs, a patch should be created and put here,
and description added to wiki.

-- 
MST


From halr at voltaire.com  Tue Dec 26 11:34:58 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 14:34:58 -0500
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <20061226192603.GA4815@mellanox.co.il>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
	<20061226192603.GA4815@mellanox.co.il>
Message-ID: <1167161695.29620.8457.camel@hal.voltaire.com>

On Tue, 2006-12-26 at 14:26, Michael S. Tsirkin wrote:
> > Should this be applied for OFED 1.1 as well ?
> 
> There are a lot of other fixes all over the stack that might be
> useful to people.
> But first EWG needs to decide how OFED 1.1 support will be done.

I thought that was already decided. Tziporet indicated to do this a
while ago (post 1.1 "ship").

> For now, the only thing we have is the support wiki with links
> to patches. So if there's a customer that is hit by one
> of these bugs, a patch should be created and put here,
> and description added to wiki.

Yes and the sources updated as well just in case a new SRPM is
created...

-- Hal


From eitan at mellanox.co.il  Tue Dec 26 11:54:48 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Tue, 26 Dec 2006 21:54:48 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <1167158845.29620.6014.camel@hal.voltaire.com>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
Message-ID: <45917E08.3010005@mellanox.co.il>

Hal Rosenstock wrote:
> Hi again Eitan,
>
> On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
>   
>> Hi Eitan,
>>
>> On Sun, 2006-12-24 at 07:40, Eitan Zahavi wrote:
>>     
>>> Hi Hal,
>>>
>>> OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
>>> query even though the selector might be allowing for selecting an
>>> appropriate value.
>>>
>>> I have made the attached ibis based program for testing MTU select.
>>>
>>> After this fix the following results are obtained for a case of
>>> path allowing maximal 2K MTU .
>>>
>>> In standard mode:
>>> ------------------------------------------------------------
>>> MTU greater then ... 256     (0x01) ->  equal to ....... 2K
>>> MTU less then ...... 256     (0x41) ->  NO PATHS
>>> MTU equal to ....... 256     (0x81) ->  equal to ....... 256
>>> MTU largest possible 256     (0xc1) ->  equal to ....... 2K
>>> MTU greater then ... 512     (0x02) ->  equal to ....... 2K
>>> MTU less then ...... 512     (0x42) ->  equal to ....... 256
>>> MTU equal to ....... 512     (0x82) ->  equal to ....... 512
>>> MTU largest possible 512     (0xc2) ->  equal to ....... 2K
>>> MTU greater then ... 1K      (0x03) ->  equal to ....... 2K
>>> MTU less then ...... 1K      (0x43) ->  equal to ....... 512
>>> MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
>>> MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
>>> MTU greater then ... 2K      (0x04) ->  NO PATHS
>>> MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
>>> MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
>>> MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
>>> MTU greater then ... 4K      (0x05) ->  NO PATHS
>>> MTU less then ...... 4K      (0x45) ->  equal to ....... 2K
>>> MTU equal to ....... 4K      (0x85) ->  NO PATHS
>>> MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
>>> ============================================================
>>>
>>> With enable_quirks (when one of the ends is a Tavor device):
>>> ------------------------------------------------------------
>>> MTU greater then ... 256     (0x01) ->  equal to ....... 1K
>>> MTU less then ...... 256     (0x41) ->  NO PATHS
>>> MTU equal to ....... 256     (0x81) ->  equal to ....... 256
>>> MTU largest possible 256     (0xc1) ->  equal to ....... 2K
>>> MTU greater then ... 512     (0x02) ->  equal to ....... 1K
>>> MTU less then ...... 512     (0x42) ->  equal to ....... 256
>>> MTU equal to ....... 512     (0x82) ->  equal to ....... 512
>>> MTU largest possible 512     (0xc2) ->  equal to ....... 2K
>>> MTU greater then ... 1K      (0x03) ->  NO PATHS
>>> MTU less then ...... 1K      (0x43) ->  equal to ....... 512
>>> MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
>>> MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
>>> MTU greater then ... 2K      (0x04) ->  NO PATHS
>>> MTU less then ...... 2K      (0x44) ->  equal to ....... 1K
>>> MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
>>> MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
>>> MTU greater then ... 4K      (0x05) ->  NO PATHS
>>> MTU less then ...... 4K      (0x45) ->  equal to ....... 1K
>>> MTU equal to ....... 4K      (0x85) ->  NO PATHS
>>> MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
>>> ============================================================
>>>
>>> Signed-off-by: Eitan Zahavi <eitan at mellanox.co.il>
>>>       
>> Thanks. Applied. Note osm_sa_multipath_record.c had 2 rejected hunks
>> which were applied by hand.
>>     
>
> Should this be applied for OFED 1.1 as well ?
>   
I would say it should. But I think it deserves OFED group call.
> -- Hal
>
>   
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From mst at mellanox.co.il  Tue Dec 26 12:01:58 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 22:01:58 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <1167161695.29620.8457.camel@hal.voltaire.com>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
	<20061226192603.GA4815@mellanox.co.il>
	<1167161695.29620.8457.camel@hal.voltaire.com>
Message-ID: <20061226200158.GF4815@mellanox.co.il>

> > > Should this be applied for OFED 1.1 as well ?
> > 
> > There are a lot of other fixes all over the stack that might be
> > useful to people.
> > But first EWG needs to decide how OFED 1.1 support will be done.
> 
> I thought that was already decided. Tziporet indicated to do this a
> while ago (post 1.1 "ship").

The support page. Yes. But not for new SRPMs.

> > For now, the only thing we have is the support wiki with links
> > to patches. So if there's a customer that is hit by one
> > of these bugs, a patch should be created and put here,
> > and description added to wiki.
> 
> Yes and the sources updated as well just in case a new SRPM is
> created...

That's the big question. Suppose someone decides there's a
show-stopper he wants fixed (like ehca guys had) and wants to build
a bugfix release. This entity might not care about or use opensm,
but since you checked stuff into branch, a version of opensm that
was not properly QA'd will get dropped in this dot release.
It would have been better to stick with the QA'd code from 1.1.

So what I am saying, *when* there's a release the person(s)
that do it should decide changes in which packages do they want.

All this stems from the model we had for OFED, where
we have a global "BUILD ID" and a monolitic package
instead of a set of modules which can be updated individually.

Hopefully maintainers (besides Roland that is) will finally
start making releases of packages, then OFED will package
them together but user will be later able to update some package
separately. This clearly applies to userspace libraries,
and maybe for kernel modules we can also invent something like this too,
so that e.g. ehca module can be updated without risking breaking
mthca.

-- 
MST


From mst at mellanox.co.il  Tue Dec 26 12:04:16 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 22:04:16 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <45917E08.3010005@mellanox.co.il>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
	<45917E08.3010005@mellanox.co.il>
Message-ID: <20061226200416.GG4815@mellanox.co.il>

> > Should this be applied for OFED 1.1 as well ?
> >   
> I would say it should. But I think it deserves OFED group call.

I think we should apply things to ofed branch only before bugfix release,
and only for packages that will be re-tested, otherwise untested code
will ship.

-- 
MST


From halr at voltaire.com  Tue Dec 26 12:30:30 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 15:30:30 -0500
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLife explicitly ignoring selectors
In-Reply-To: <20061226200158.GF4815@mellanox.co.il>
References: <458E7532.5030400@mellanox.co.il>
	<1167154075.29620.1731.camel@hal.voltaire.com>
	<1167158845.29620.6014.camel@hal.voltaire.com>
	<20061226192603.GA4815@mellanox.co.il>
	<1167161695.29620.8457.camel@hal.voltaire.com>
	<20061226200158.GF4815@mellanox.co.il>
Message-ID: <1167165028.29620.11344.camel@hal.voltaire.com>

On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote:
> > > > Should this be applied for OFED 1.1 as well ?
> > > 
> > > There are a lot of other fixes all over the stack that might be
> > > useful to people.
> > > But first EWG needs to decide how OFED 1.1 support will be done.
> > 
> > I thought that was already decided. Tziporet indicated to do this a
> > while ago (post 1.1 "ship").
> 
> The support page. Yes. But not for new SRPMs.

That's fine with me but not what a previous email said (in terms of
updating the sources) and what has been followed for OpenSM at least
until now...

> > > For now, the only thing we have is the support wiki with links
> > > to patches. So if there's a customer that is hit by one
> > > of these bugs, a patch should be created and put here,
> > > and description added to wiki.
> > 
> > Yes and the sources updated as well just in case a new SRPM is
> > created...
> 
> That's the big question. Suppose someone decides there's a
> show-stopper he wants fixed (like ehca guys had) and wants to build
> a bugfix release. This entity might not care about or use opensm,
> but since you checked stuff into branch, a version of opensm that
> was not properly QA'd will get dropped in this dot release.
> It would have been better to stick with the QA'd code from 1.1.
> 
> So what I am saying, *when* there's a release the person(s)
> that do it should decide changes in which packages do they want.
> 
> All this stems from the model we had for OFED, where
> we have a global "BUILD ID" and a monolitic package
> instead of a set of modules which can be updated individually.
> 
> Hopefully maintainers (besides Roland that is) will finally
> start making releases of packages,

This has been agreed to and will be done before 1/31 for OFED 1.2.

-- Hal

>  then OFED will package
> them together but user will be later able to update some package
> separately. This clearly applies to userspace libraries,
> and maybe for kernel modules we can also invent something like this too,
> so that e.g. ehca module can be updated without risking breaking
> mthca.


From mst at mellanox.co.il  Tue Dec 26 12:45:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 22:45:04 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLifeexplicitly ignoring selectors
In-Reply-To: <1167165028.29620.11344.camel@hal.voltaire.com>
References: <1167165028.29620.11344.camel@hal.voltaire.com>
Message-ID: <20061226204504.GB4329@mellanox.co.il>

> On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote:
> > > > > Should this be applied for OFED 1.1 as well ?
> > > > 
> > > > There are a lot of other fixes all over the stack that might be
> > > > useful to people.
> > > > But first EWG needs to decide how OFED 1.1 support will be done.
> > > 
> > > I thought that was already decided. Tziporet indicated to do this a
> > > while ago (post 1.1 "ship").
> > 
> > The support page. Yes. But not for new SRPMs.
> 
> That's fine with me but not what a previous email said (in terms of
> updating the sources) and what has been followed for OpenSM at least
> until now...

Maybe I'm wrong. I don't have that mail around.
Was not the idea that when someone wants to do a bugfix release
he puts just these fixes in a package, tests it and releases the update?

If so opensm should be updated only if it will be-retested, and
this is only needed before release.

-- 
MST


From halr at voltaire.com  Tue Dec 26 12:55:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 26 Dec 2006 15:55:00 -0500
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLifeexplicitly ignoring selectors
In-Reply-To: <20061226204504.GB4329@mellanox.co.il>
References: <1167165028.29620.11344.camel@hal.voltaire.com>
	<20061226204504.GB4329@mellanox.co.il>
Message-ID: <1167166497.29620.12552.camel@hal.voltaire.com>

On Tue, 2006-12-26 at 15:45, Michael S. Tsirkin wrote:
> > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote:
> > > > > > Should this be applied for OFED 1.1 as well ?
> > > > > 
> > > > > There are a lot of other fixes all over the stack that might be
> > > > > useful to people.
> > > > > But first EWG needs to decide how OFED 1.1 support will be done.
> > > > 
> > > > I thought that was already decided. Tziporet indicated to do this a
> > > > while ago (post 1.1 "ship").
> > > 
> > > The support page. Yes. But not for new SRPMs.
> > 
> > That's fine with me but not what a previous email said (in terms of
> > updating the sources) and what has been followed for OpenSM at least
> > until now...
> 
> Maybe I'm wrong. I don't have that mail around.

I can repost it if needed (or point to a URL for it).

> Was not the idea that when someone wants to do a bugfix release
> he puts just these fixes in a package, tests it and releases the update?

There was no mention of the testing aspects in that email.

-- Hal

> If so opensm should be updated only if it will be-retested, and
> this is only needed before release.


From mst at mellanox.co.il  Tue Dec 26 13:38:04 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Tue, 26 Dec 2006 23:38:04 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in
 usingMTU/rate/PktLifeexplicitly ignoring selectors
In-Reply-To: <1167166497.29620.12552.camel@hal.voltaire.com>
References: <1167166497.29620.12552.camel@hal.voltaire.com>
Message-ID: <20061226213804.GC4329@mellanox.co.il>

> > > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote:
> > > > > > > Should this be applied for OFED 1.1 as well ?
> > > > > > 
> > > > > > There are a lot of other fixes all over the stack that might be
> > > > > > useful to people.
> > > > > > But first EWG needs to decide how OFED 1.1 support will be done.
> > > > > 
> > > > > I thought that was already decided. Tziporet indicated to do this a
> > > > > while ago (post 1.1 "ship").
> > > > 
> > > > The support page. Yes. But not for new SRPMs.
> > > 
> > > That's fine with me but not what a previous email said (in terms of
> > > updating the sources) and what has been followed for OpenSM at least
> > > until now...
> > 
> > Maybe I'm wrong. I don't have that mail around.
> 
> I can repost it if needed (or point to a URL for it).

Why not?

> > Was not the idea that when someone wants to do a bugfix release
> > he puts just these fixes in a package, tests it and releases the update?
> 
> There was no mention of the testing aspects in that email.

So, what do you think?
 
> > If so opensm should be updated only if it will be-retested, and
> > this is only needed before release.

-- 
MST


From sashak at voltaire.com  Tue Dec 26 16:35:09 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 27 Dec 2006 02:35:09 +0200
Subject: [openib-general] [PATCH] opensm: rwlock double-release fix.
In-Reply-To: <1167158802.29620.5949.camel@hal.voltaire.com>
References: <20061224170329.GB7111@sashak.voltaire.com>
	<1167154064.29620.1727.camel@hal.voltaire.com>
	<1167158802.29620.5949.camel@hal.voltaire.com>
Message-ID: <20061227003509.GB32492@sashak.voltaire.com>

On 13:46 Tue 26 Dec     , Hal Rosenstock wrote:
> On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
> > On Sun, 2006-12-24 at 12:03, Sasha Khapyorsky wrote: 
> > > When the port is removed from subnet, but previously requested pkey
> > > table block is received after this - the lock will be released twice.
> > > This leads to deadlocks later when other MAD processor will try to
> > > acquire the same lock.
> > > 
> > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > 
> > Thanks. Applied.
> 
> Looks like this applied to OFED 1.1 as well.

Yes, this is the old code.

Sasha


From sashak at voltaire.com  Tue Dec 26 16:35:55 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 27 Dec 2006 02:35:55 +0200
Subject: [openib-general] [PATCH] opensm: clean old references on ports
 linking
In-Reply-To: <1167158805.29620.5951.camel@hal.voltaire.com>
References: <20061224174315.GC7111@sashak.voltaire.com>
	<1167154069.29620.1729.camel@hal.voltaire.com>
	<1167158805.29620.5951.camel@hal.voltaire.com>
Message-ID: <20061227003555.GC32492@sashak.voltaire.com>

On 13:47 Tue 26 Dec     , Hal Rosenstock wrote:
> On Tue, 2006-12-26 at 12:28, Hal Rosenstock wrote:
> >  On Sun, 2006-12-24 at 12:43, Sasha Khapyorsky wrote:
> > > When linking ports, cleanup old remote references. Without it the ports
> > > still be accessible as "linked" from old neighbors and in case of ports
> > > moving, when some MADs can be lost or reordered, OpenSM subnet data
> > > structures become broken.
> > > 
> > > Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
> > 
> > Good catch.
> > 
> > Thanks. Applied.
> 
> Looks like this applied to OFED 1.1 as well.

Yes.

Sasha


From sashak at voltaire.com  Tue Dec 26 17:16:15 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 27 Dec 2006 03:16:15 +0200
Subject: [openib-general] [PATCH] osm:Fix PathRecord bug in using
 MTU/rate/PktLifeexplicitly ignoring selectors
In-Reply-To: <20061226204504.GB4329@mellanox.co.il>
References: <1167165028.29620.11344.camel@hal.voltaire.com>
	<20061226204504.GB4329@mellanox.co.il>
Message-ID: <20061227011615.GD32492@sashak.voltaire.com>

On 22:45 Tue 26 Dec     , Michael S. Tsirkin wrote:
> > On Tue, 2006-12-26 at 15:01, Michael S. Tsirkin wrote:
> > > > > > Should this be applied for OFED 1.1 as well ?
> > > > > 
> > > > > There are a lot of other fixes all over the stack that might be
> > > > > useful to people.
> > > > > But first EWG needs to decide how OFED 1.1 support will be done.
> > > > 
> > > > I thought that was already decided. Tziporet indicated to do this a
> > > > while ago (post 1.1 "ship").
> > > 
> > > The support page. Yes. But not for new SRPMs.
> > 
> > That's fine with me but not what a previous email said (in terms of
> > updating the sources) and what has been followed for OpenSM at least
> > until now...
> 
> Maybe I'm wrong. I don't have that mail around.
> Was not the idea that when someone wants to do a bugfix release
> he puts just these fixes in a package, tests it and releases the update?

What is the point to put all together and have minimal testing time,
w/out any native pre-release testing? Why source version control is
needed then?

If you need to remember where a last release point was just use tag
(or date). And if one will need to "cherrypick" fixes she/he will be
able to use this tag.

Sasha

> 
> If so opensm should be updated only if it will be-retested, and
> this is only needed before release.
> 
> -- 
> MST
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 


From eitan at sw053.yok.mtl.com  Tue Dec 26 21:10:18 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Wed, 27 Dec 2006 07:10:18 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-27:normal completion
Message-ID: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Tue_Dec_26_12:24:26_2006 1ae301 
ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b 
Total=351 Pass=349 Fail=2

Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
38 OsmTest IS1-16.topo
38 OsmStress IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo

Failures:
1 OsmTest IS1-16.topo
1 OsmStress IS1-16.topo


From eitan at mellanox.co.il  Tue Dec 26 22:47:47 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 27 Dec 2006 08:47:47 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-27:normal
 completion
In-Reply-To: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com>
References: <200612270510.kBR5AIn5016958@sw053.yok.mtl.com>
Message-ID: <45921713.4040301@mellanox.co.il>

Analysis:

OsmStress: TEST ISSUE = Somehow OpenSM lost it's local port which should have never get into DOWN state.
OsmTest: ibmgtsim issue = the fix I introduced in for the 
         deadlock actually causes a race on client close that make the simulator segfault. I need to 
         really resolve the deadlock. Should have known it's coming.

EZ


Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Tue_Dec_26_12:24:26_2006 1ae301 
> ibutils rev = Tue_Dec_26_00:00:31_2006 f81b3b 
> Total=351 Pass=349 Fail=2
>
> Pass:
> 39 Stability IS1-16.topo
> 39 Pkey IS1-16.topo
> 39 Multicast IS1-16.topo
> 39 LidMgr IS1-16.topo
> 38 OsmTest IS1-16.topo
> 38 OsmStress IS1-16.topo
> 13 Stability IS3-loop.topo
> 13 Stability IS3-128.topo
> 13 Pkey IS3-128.topo
> 13 OsmTest IS3-loop.topo
> 13 OsmTest IS3-128.topo
> 13 OsmStress IS3-128.topo
> 13 Multicast IS3-loop.topo
> 13 Multicast IS3-128.topo
> 13 LidMgr IS3-128.topo
>
> Failures:
> 1 OsmTest IS1-16.topo
> 1 OsmStress IS1-16.topo
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From yosefe at voltaire.com  Tue Dec 26 23:27:41 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 27 Dec 2006 09:27:41 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <20061225234648.GJ17469@mellanox.co.il>
References: <45900F45.50906@voltaire.com>
	<20061225234648.GJ17469@mellanox.co.il>
Message-ID: <4592206D.3070206@voltaire.com>

Michael S. Tsirkin wrote:
>>>Subject: Re: ofed 1.2 - compilation erros on ppc64 and ia64
>>>
>>>Michael S. Tsirkin wrote:
>>>
>>>
>>>
>>>>>>Quoting r. Yosef Etigin <yosefe at voltaire.com>:
>>>>>>Subject: ofed 1.2 - compilation erros on ppc64 and ia64
>>>>>>  
>>>>>>
>>>>>
>>>>>Which distro are you testing on?
>>>>>
>>>>>
>>>>>
>>>>
>>>>I am testing on sles10, both ia64 and ppc64.
>>>>
>>>>
>>>>
>>>>>>Hello,
>>>>>>I've been testing ofed 1.2 build from 
>>>>>>http://staging.openfabrics.org/builds/ 
>>>>>><http://staging.openfabrics.org/build/>, (latest.tgz versions both user 
>>>>>>and kernel) and got compilation erros on: ia64, ppc64:
>>>>>>
>>>>>>*ppc64:*
>>>>>>
>>>>>>  make -w -C ip ip
>>>>>>  make[2]: Entering directory
>>>>>>  `/tmp/openib_gen2/userspace/src/userspace/ipoibtools/iproute2/ip'
>>>>>>  [ ... omitted text ... ]
>>>>>>  gcc -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall -I../include
>>>>>>  -DRESOLVE_HOSTNAMES -c -o xfrm_monitor.o xfrm_monitor.c
>>>>>>  gcc -g -O2 -m64 -L/usr/lib64 ip.o ipaddress.o iproute.o iprule.o
>>>>>>  rtm_map.o iptunnel.o ipneigh.o ipntable.o iplink.o ipmaddr.o
>>>>>>  ipmonitor.o ipmroute.o ipprefix.o ipxfrm.o xfrm_state.o
>>>>>>  xfrm_policy.o xfrm_monitor.o ../lib/libnetlink.a ../lib/libutil.a
>>>>>>  -lresolv -L../lib -lnetlink -lutil -o ip
>>>>>>  /usr/bin/ld: skipping incompatible ../lib/libnetlink.a when
>>>>>>  searching for -lnetlink
>>>>>>  /usr/bin/ld: skipping incompatible
>>>>>>  /usr/lib/gcc/powerpc64-suse-linux/4.1.0/../../../libnetlink.a when
>>>>>>  searching for -lnetlink
>>>>>>  /usr/bin/ld: skipping incompatible /usr/lib/libnetlink.a when
>>>>>>  searching for -lnetlink
>>>>>>  /usr/bin/ld: cannot find -lnetlink
>>>>>>  collect2: ld returned 1 exit status
>>>>>>  make[2]: *** [ip] Error 1
>>>>>>
>>>>>>possible cause: the src/userspace/ipoibtools/iproute2/Makefile overrides 
>>>>>>CFLAGS (= instead of +=)
>>>>>>  
>>>>>>
>>>>>
>>>>>Isn't this makefile part of iproute2?
>>>>>Can you build iproute on this platform?
>>>>>
>>>>>
>>>>
>>>>This makefile is indeed of iproute,
>>>>but it seems to make 32-bit object files for `iproute' during compilation
>>>>and therefore fails to find 64-bit during linkage of `ip'.
>>>
>>>
>>>Will installing the 32 bit version of the library help?
>>>
>>>
>>
>>I dont think so.. the issue arised during compilation, since `iproute' 
>>was inconsinsten in its use of -m64:
>>The iproute Makefile overrides any `CFLAGS' it might get from top-level, 
>>thus throwing `-m64' away, while LDFLAGS are not overriden.
>>Therefore, the compilation is done in 32bit while the linkage in 64bit
> 
> 
> Probably the easies thing is to fix iproute. Patch?
> 
> 
>>>>>	
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>*ia64:*
>>>>>>
>>>>>>  make -f /usr/src/linux-2.6.16.21-0.8/scripts/Makefile.build
>>>>>>  obj=/tmp/openib_gen2/kernel/drivers/infiniband/core
>>>>>>  gcc [ ... omitted text ... ] -c -o
>>>>>>  /tmp/openib_gen2/kernel/drivers/infiniband/core/.tmp_addr.o
>>>>>>  /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c
>>>>>>  In file included from /tmp/openib_gen2/kernel/include/rdma/ib_addr.h:37,
>>>>>>  from /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:38:
>>>>>>  /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>>>>>  ‘ib_sg_dma_address’:
>>>>>>  /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1577: error:
>>>>>>  implicit declaration of function ‘sg_dma_address’
>>>>>>  /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h: In function
>>>>>>  ‘ib_sg_dma_len’:
>>>>>>  /tmp/openib_gen2/kernel/include/rdma/ib_verbs.h:1590: error:
>>>>>>  implicit declaration of function ‘sg_dma_len’
>>>>>>  /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c: At top level:
>>>>>>  /tmp/openib_gen2/kernel/drivers/infiniband/core/addr.c:61: warning:
>>>>>>  initialization from incompatible pointer type
>>>>>>  [ ... omitted text ... ]
>>>>>>  make: *** [kernel] Error 2
>>>>>>  
>>>>>>
>>>>>
>>>>>Probably a distro-specific backport problem - check how come sg_dma_len is not defined.
>>>>>I see this on upstream 2.6.16
>>>>>	asm-powerpc/scatterlist.h:#define sg_dma_len(sg) ((sg)->dma_length)
>>>>>
>>>>>
>>>>
>>>>Im running this of ia64, `sg_dma_len' is not defined there, nor anywhere 
>>>>else in this file, but in:
>>>>       ./asm-ia64/pci.h:82:#define sg_dma_len(sg)    ((sg)->dma_length)
>>>>
>>>
>>>
>>>Isee, its fixed on 2.6.20.
>>>Need to do something about it in the backport then.
>>>
>>>I wonder whether we can just put
>>>#ifdef __ia64__
>>>#define sg_dma_len(sg)          ((sg)->dma_length)
>>>#endif
>>>
>>>in kernel_addons/backports/2.6.16/include/asm/scatterlist.h
>>>
>>>Also need tofind out in which kernel this was fixed.
>>>
>>
>>Looks like in all kernels up to 2.6.20 it was in `pci.h' so need to 
>>backtort to.. all previous versions
> 
> 
> Right. Try sticking this in kernel_addons/backports/2.6.20 and
> copying it over.
> 

OK, I put:

#ifndef BACKPORT_SCATTERLIST_H
#define BACKPORT_SCATTERLIST_H

#include_next <asm/scatterlist.h>

#ifdef __ia64__
#define sg_dma_address(sg)     ((sg)->dma_address)
#define sg_dma_len(sg)         ((sg)->dma_length)
#endif

#endif

in kernel_addons/backport/X where X<=2.6.19
and it does the job

--
Yossi


From mst at mellanox.co.il  Tue Dec 26 23:53:21 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 09:53:21 +0200
Subject: [openib-general] ofed 1.2 - compilation erros on ppc64 and ia64
In-Reply-To: <4592206D.3070206@voltaire.com>
References: <45900F45.50906@voltaire.com>
	<20061225234648.GJ17469@mellanox.co.il> <4592206D.3070206@voltaire.com>
Message-ID: <20061227075321.GE19436@mellanox.co.il>

> OK, I put:
> 
> #ifndef BACKPORT_SCATTERLIST_H
> #define BACKPORT_SCATTERLIST_H
> 
> #include_next <asm/scatterlist.h>
> 
> #ifdef __ia64__
> #define sg_dma_address(sg)     ((sg)->dma_address)
> #define sg_dma_len(sg)         ((sg)->dma_length)
> #endif
> 
> #endif
> 
> in kernel_addons/backport/X where X<=2.6.19
> and it does the job

OK. Where can I pull all this from?

-- 
MST


From kliteyn at dev.mellanox.co.il  Wed Dec 27 01:03:23 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 11:03:23 +0200
Subject: [openib-general] [PATCH 1/3] osm: Changes for windows compatability
Message-ID: <459236DB.70009@dev.mellanox.co.il>

Hi Hal.

Fixing windows compilation problems.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/iba/ib_types.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
index 723e8b9..ec65b64 100644
--- a/osm/include/iba/ib_types.h
+++ b/osm/include/iba/ib_types.h
@@ -59,9 +59,10 @@ BEGIN_C_DECLS
          #define OSM_EXPORT	__declspec(dllimport)
     #endif
     #define OSM_API __stdcall
+    #define OSM_CDECL __cdecl
 #else
     #define OSM_EXPORT	extern
-    #define OSM_API
+    #define OSM_CDECL
     #define __ptr64
 #endif
 
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 27 01:03:50 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 11:03:50 +0200
Subject: [openib-general] [PATCH 2/3] osm: Changes for windows compatability
Message-ID: <459236F6.8060707@dev.mellanox.co.il>

Hi Hal.

Fixing windows compilation problems.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |   42 ++++++++++++++++++++++--------------------
 1 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index ba95a0d..054e3c9 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -135,8 +135,8 @@ typedef uint8_t * ftree_fwd_tbl_t;
 typedef struct ftree_port_t_ 
 {
    cl_map_item_t  map_item;
-   uint16_t       port_num;           /* port number on the current node */
-   uint16_t       remote_port_num;    /* port number on the remote node */
+   uint8_t        port_num;           /* port number on the current node */
+   uint8_t        remote_port_num;    /* port number on the remote node */
    uint32_t       counter_up;         /* number of allocated routs upwards */
    uint32_t       counter_down;       /* number of allocated routs downwards */
 } ftree_port_t;
@@ -212,7 +212,7 @@ typedef struct ftree_fabric_t_
    cl_qmap_t       hca_tbl;
    cl_qmap_t       sw_tbl;
    cl_qmap_t       sw_by_tuple_tbl;
-   uint32_t        tree_rank;
+   uint16_t        tree_rank;
    ftree_sw_t   ** leaf_switches;
    uint32_t        leaf_switches_num;
    uint16_t        max_hcas_per_leaf;
@@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_
  **
  ***************************************************/
 
-int
+int OSM_CDECL
 __osm_ftree_compare_switches_by_index(
    IN  const void * p1, 
    IN  const void * p2)
@@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index(
 
 /***************************************************/
 
-int
+int OSM_CDECL
 __osm_ftree_compare_port_groups_by_remote_switch_index(
    IN  const void * p1, 
    IN  const void * p2)
@@ -401,8 +401,8 @@ __osm_ftree_sw_tbl_element_destroy(
 
 static ftree_port_t * 
 __osm_ftree_port_create( 
-   IN  uint16_t port_num,
-   IN  uint16_t remote_port_num)
+   IN  uint8_t port_num,
+   IN  uint8_t remote_port_num)
 {
    ftree_port_t * p_port = (ftree_port_t *)malloc(sizeof(ftree_port_t));
    if (!p_port)
@@ -553,8 +553,8 @@ __osm_ftree_port_group_dump(
 static void
 __osm_ftree_port_group_add_port(
    IN  ftree_port_group_t * p_group,
-   IN  uint16_t             port_num,
-   IN  uint16_t             remote_port_num)
+   IN  uint8_t              port_num,
+   IN  uint8_t              remote_port_num)
 {
    uint16_t i;
    ftree_port_t * p_port;
@@ -722,8 +722,8 @@ __osm_ftree_sw_get_port_group_by_remote_
 static void 
 __osm_ftree_sw_add_port(
    IN  ftree_sw_t       * p_sw,
-   IN  uint16_t           port_num,
-   IN  uint16_t           remote_port_num,
+   IN  uint8_t            port_num,
+   IN  uint8_t            remote_port_num,
    IN  ib_net16_t         base_lid,
    IN  uint8_t            lmc,
    IN  ib_net16_t         remote_base_lid,
@@ -872,8 +872,8 @@ __osm_ftree_hca_get_port_group_by_remote
 static void 
 __osm_ftree_hca_add_port(
    IN  ftree_hca_t * p_hca,
-   IN  uint16_t      port_num,
-   IN  uint16_t      remote_port_num,
+   IN  uint8_t       port_num,
+   IN  uint8_t       remote_port_num,
    IN  ib_net16_t    base_lid,
    IN  uint8_t       lmc,
    IN  ib_net16_t    remote_base_lid,
@@ -1799,7 +1799,7 @@ __osm_ftree_fabric_route_upgoing_by_goin
 
       /* find the least loaded port of the group (in indexing order) */
       p_min_port = NULL;
-      ports_num = cl_ptr_vector_get_size(&p_group->ports);
+      ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports);
       /* ToDo: no need to select a least loaded port for non-main path.
          Think about optimization. */
       for (j = 0; j < ports_num; j++) 
@@ -1951,7 +1951,7 @@ __osm_ftree_fabric_route_downgoing_by_go
    {
       p_group = p_sw->up_port_groups[i];
 
-      ports_num = cl_ptr_vector_get_size(&p_group->ports);
+      ports_num = (uint16_t)cl_ptr_vector_get_size(&p_group->ports);
       for (j = 0; j < ports_num; j++)
       {
          cl_ptr_vector_at(&p_group->ports, j, (void **)&p_port);
@@ -2182,7 +2182,9 @@ __osm_ftree_fabric_route_to_hcas(
          osm_log(&p_ftree->p_osm->log, OSM_LOG_DEBUG,"__osm_ftree_fabric_route_to_hcas: "
                  "Routing %u dummy HCAs\n",
                  p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
-         for (j = 0; j < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num); j++)
+         for ( j = 0;
+               ((int)j) < (p_ftree->max_hcas_per_leaf - p_sw->down_port_groups_num);
+               j++)
          {
             /* assign downgoing ports by stepping up */
             __osm_ftree_fabric_route_downgoing_by_going_up(
@@ -2329,7 +2331,7 @@ __osm_ftree_rank_from_switch(
    osm_node_t   * p_node;
    osm_node_t   * p_remote_node;
    osm_physp_t  * p_osm_port;
-   uint16_t       i;
+   uint8_t        i;
    cl_list_t      bfs_list;
    ftree_sw_tbl_element_t * p_sw_tbl_element = NULL;
 
@@ -2394,7 +2396,7 @@ __osm_ftree_rank_switches_from_hca(
    osm_node_t     * p_osm_node = p_hca->p_osm_node;
    osm_node_t     * p_remote_osm_node;
    osm_physp_t    * p_osm_port;
-   static uint16_t i = 0;
+   static uint8_t   i = 0;
    int res = 0;
 
    OSM_LOG_ENTER(&p_ftree->p_osm->log, __osm_ftree_rank_switches_from_hca);
@@ -2493,7 +2495,7 @@ __osm_ftree_fabric_construct_hca_ports(
    uint8_t           remote_node_type;
    ib_net64_t        remote_node_guid;
    osm_physp_t     * p_remote_osm_port;
-   uint16_t          i;
+   uint8_t           i;
    uint8_t           remote_port_num;
    int res = 0;
 
@@ -2590,7 +2592,7 @@ __osm_ftree_fabric_construct_sw_ports(
    osm_physp_t       * p_remote_osm_port;
    ftree_direction_t   direction;
    void              * p_remote_hca_or_sw;
-   uint16_t            i;
+   uint8_t             i;
    uint8_t             remote_port_num;
    int res = 0;
 
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 27 01:05:18 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 11:05:18 +0200
Subject: [openib-general] [PATCH 3/3] osm: Changes for windows compatability
Message-ID: <4592374E.7020008@dev.mellanox.co.il>

Hi Hal.

Fixing windows compilation problems.

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/osmtest/osmtest.c |   11 +++++++----
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index 0ccc06c..05b1134 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -471,10 +471,13 @@ osmtest_destroy( IN osmtest_t * const p_
 {
   cl_map_item_t *p_item,*p_next_item;
 
+  /* Currently there is a problem with IBAL exit flow - memory overrun,
+     so bypass vendor deletion - it will be cleaned by the Windows OS */
+#ifndef __WIN__
   if( p_osmt->p_vendor )
-  {
     osm_vendor_delete( &p_osmt->p_vendor );
-  }
+#endif
+
   cl_qpool_destroy( &p_osmt->port_pool );
   cl_qpool_destroy( &p_osmt->node_pool );
 
@@ -4922,7 +4925,7 @@ osmtest_informinfo_request(
     /* as currently no comp mask bits defined for InformInfo!!! */
     user.comp_mask = IB_IIR_COMPMASK_SUBSCRIBE;
     p_inform_info_opt = p_options;
-    rec.subscribe = p_inform_info_opt->subscribe;
+    rec.subscribe = (uint8_t)p_inform_info_opt->subscribe;
     if (p_inform_info_opt->qpn)
     {
       rec.g_or_v.generic.qpn_resp_time_val = cl_hton32(p_inform_info_opt->qpn) >> 8;
@@ -5601,7 +5604,7 @@ osmtest_validate_against_db( IN osmtest_
 #ifdef DUAL_SIDED_RMPP
   osmv_multipath_req_t request;
 #endif
-  int i; 
+  uint8_t i; 
 #endif
 
   OSM_LOG_ENTER( &p_osmt->log, osmtest_validate_against_db );
-- 
1.4.4.1.GIT


From yosefe at voltaire.com  Wed Dec 27 02:08:14 2006
From: yosefe at voltaire.com (Yosef Etigin)
Date: Wed, 27 Dec 2006 12:08:14 +0200
Subject: [openib-general] [PATCH] ofed 1.2: fix module compilation erros on
	ia64
Message-ID: <1167214095.27740.13.camel@muscida>

Fix compilation errors on ia64 that are caused by the definition of
sg_dma_address and sg_dma_len in asm-ia64/pci.h instead of in
asm/scatterlist.h, as in other architectures.

tested on: ia64[sles10]; x86_64 [sles10,rh4]

Signed-off by: Yosef Etigin <yosefe at voltaire.com>

---
diff -urN a/kernel_addons/backport/2.6.11/include/asm/scatterlist.h b/kernel_addons/backport/2.6.11/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.11/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.11/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h b/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.11_FC4/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.12/include/asm/scatterlist.h b/kernel_addons/backport/2.6.12/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.12/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.12/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.13/include/asm/scatterlist.h b/kernel_addons/backport/2.6.13/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.13/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.13/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h b/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.13_suse10_0_u/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.14/include/asm/scatterlist.h b/kernel_addons/backport/2.6.14/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.14/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.14/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.15/include/asm/scatterlist.h b/kernel_addons/backport/2.6.15/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.15/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.15/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.16/include/asm/scatterlist.h b/kernel_addons/backport/2.6.16/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.16/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.16/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h b/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.16_sles10/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.17/include/asm/scatterlist.h b/kernel_addons/backport/2.6.17/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.17/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.17/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.18/include/asm/scatterlist.h b/kernel_addons/backport/2.6.18/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.18/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.18/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h b/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.18_FC6/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.19/include/asm/scatterlist.h b/kernel_addons/backport/2.6.19/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.19/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.19/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h b/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.5-7.244/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.9_U2/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.9_U3/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
diff -urN a/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h b/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h
--- a/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h	1970-01-01 02:00:00.000000000 +0200
+++ b/kernel_addons/backport/2.6.9_U4/include/asm/scatterlist.h	2006-12-26 16:07:21.000000000 +0200
@@ -0,0 +1,12 @@
+/* fix sg_dma_len in ia64 being in pci.h instead of scatterlist.h */
+#ifndef BACKPORT_SCATTERLIST_H
+#define BACKPORT_SCATTERLIST_H
+
+#include_next <asm/scatterlist.h>
+
+#ifdef __ia64__
+#define sg_dma_address(sg)     ((sg)->dma_address)
+#define sg_dma_len(sg)         ((sg)->dma_length)
+#endif
+
+#endif
--
Yosef Etigin
yosefe at voltaire.com


From jsquyres at cisco.com  Wed Dec 27 05:13:25 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 08:13:25 -0500
Subject: [openib-general] Old svn repository access
In-Reply-To: <1167160526.29620.7478.camel@hal.voltaire.com>
References: <1167160526.29620.7478.camel@hal.voltaire.com>
Message-ID: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>

This is probably my fault; sorry.  :-(

I advised Sandia that it would be ok to turn off the old server, but  
I thought that the new server was up and running.  Doing some poking  
around on staging.ofa, I see that the SVN repository is located at  
file:///data/svn, but I don't see that it's being made available via  
http[s].

I'll poke around today and see if I can get it up and running via http 
[s] on svn.openfabrics.org.


On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote:

> Hi,
>
> Thought the old svn repository was made RO. When I do a RO  
> operation to
> it, I get the following error:
>
> svn log | more
> (R)eject, accept (t)emporarily or accept (p)ermanently? svn:  
> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ 
> management/diags/src/ibnetdiscover.c'
> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ 
> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// 
> openib.org)
>
> Shouldn't this work ?
>
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From mst at mellanox.co.il  Wed Dec 27 05:24:01 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 15:24:01 +0200
Subject: [openib-general] Old svn repository access
In-Reply-To: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>
References: <1167160526.29620.7478.camel@hal.voltaire.com>
	<5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>
Message-ID: <20061227132401.GN19436@mellanox.co.il>

Can the openib.org dns be also changed to point to the new server?
Scripts from OFED 1.0 are still using that, I think we should keep
them running.

Quoting r. Jeff Squyres <jsquyres at cisco.com>:
Subject: Re: Old svn repository access

This is probably my fault; sorry.  :-(

I advised Sandia that it would be ok to turn off the old server, but  
I thought that the new server was up and running.  Doing some poking  
around on staging.ofa, I see that the SVN repository is located at  
file:///data/svn, but I don't see that it's being made available via  
http[s].

I'll poke around today and see if I can get it up and running via http 
[s] on svn.openfabrics.org.


On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote:

> Hi,
>
> Thought the old svn repository was made RO. When I do a RO  
> operation to
> it, I get the following error:
>
> svn log | more
> (R)eject, accept (t)emporarily or accept (p)ermanently? svn:  
> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/ 
> management/diags/src/ibnetdiscover.c'
> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/ 
> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https:// 
> openib.org)
>
> Shouldn't this work ?
>
> -- Hal
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


_______________________________________________
openib-general mailing list
openib-general at openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

-- 
MST


From jsquyres at cisco.com  Wed Dec 27 05:37:24 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 08:37:24 -0500
Subject: [openib-general] Old svn repository access
In-Reply-To: <20061227132401.GN19436@mellanox.co.il>
References: <1167160526.29620.7478.camel@hal.voltaire.com>
	<5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>
	<20061227132401.GN19436@mellanox.co.il>
Message-ID: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com>

On Dec 27, 2006, at 8:24 AM, Michael S. Tsirkin wrote:

> Can the openib.org dns be also changed to point to the new server?
> Scripts from OFED 1.0 are still using that, I think we should keep
> them running.

I don't think we're there yet -- I need to talk to Michael Lee before  
we make the switch to make openfabrics.org and openib.org point to  
the new server.

What exactly in OFED 1.0 uses the name openib.org -- SVN access?


> Quoting r. Jeff Squyres <jsquyres at cisco.com>:
> Subject: Re: Old svn repository access
>
> This is probably my fault; sorry.  :-(
>
> I advised Sandia that it would be ok to turn off the old server, but
> I thought that the new server was up and running.  Doing some poking
> around on staging.ofa, I see that the SVN repository is located at
> file:///data/svn, but I don't see that it's being made available via
> http[s].
>
> I'll poke around today and see if I can get it up and running via http
> [s] on svn.openfabrics.org.
>
>
>
> On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote:
>
>> Hi,
>>
>> Thought the old svn repository was made RO. When I do a RO
>> operation to
>> it, I get the following error:
>>
>> svn log | more
>> (R)eject, accept (t)emporarily or accept (p)ermanently? svn:
>> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/
>> management/diags/src/ibnetdiscover.c'
>> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/
>> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https://
>> openib.org)
>>
>> Shouldn't this work ?
>>
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/
>> openib-general
>
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general
>
> -- 
> MST


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From dotanb at dev.mellanox.co.il  Wed Dec 27 05:46:06 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Wed, 27 Dec 2006 15:46:06 +0200
Subject: [openib-general] [PATCH] [mthca] don't execute the QUERY command in
 QP is in RESET state
Message-ID: <1167227166.6664.2.camel@mtls05.yok.mtl.com>

If the QP state is RESET, don't execute the QUERY command
(because it will fail).

Signed-off-by: Dotan Barak <dotanb at mellanox.co.il>

---

Index: gen2_devel_kernel/drivers/infiniband/hw/mthca/mthca_qp.c
===================================================================
--- gen2_devel_kernel.orig/drivers/infiniband/hw/mthca/mthca_qp.c	2006-12-24 19:41:56.000000000 +0200
+++ gen2_devel_kernel/drivers/infiniband/hw/mthca/mthca_qp.c	2006-12-25 15:54:49.000000000 +0200
@@ -429,13 +429,18 @@ int mthca_query_qp(struct ib_qp *ibqp, s
 {
 	struct mthca_dev *dev = to_mdev(ibqp->device);
 	struct mthca_qp *qp = to_mqp(ibqp);
-	int err;
-	struct mthca_mailbox *mailbox;
+	int err = 0;
+	struct mthca_mailbox *mailbox = NULL;
 	struct mthca_qp_param *qp_param;
 	struct mthca_qp_context *context;
 	int mthca_state;
 	u8 status;
 
+	if (qp->state == IB_QPS_RESET) {
+		qp_attr->qp_state = IB_QPS_RESET;
+		goto done;
+	}
+
 	mailbox = mthca_alloc_mailbox(dev, GFP_KERNEL);
 	if (IS_ERR(mailbox))
 		return PTR_ERR(mailbox);
@@ -454,7 +459,6 @@ int mthca_query_qp(struct ib_qp *ibqp, s
 	mthca_state = be32_to_cpu(context->flags) >> 28;
 
 	qp_attr->qp_state 	     = to_ib_qp_state(mthca_state);
-	qp_attr->cur_qp_state 	     = qp_attr->qp_state;
 	qp_attr->path_mtu 	     = context->mtu_msgmax >> 5;
 	qp_attr->path_mig_state      =
 		to_ib_mig_state((be32_to_cpu(context->flags) >> 11) & 0x3);
@@ -464,11 +468,6 @@ int mthca_query_qp(struct ib_qp *ibqp, s
 	qp_attr->dest_qp_num 	     = be32_to_cpu(context->remote_qpn) & 0xffffff;
 	qp_attr->qp_access_flags     =
 		to_ib_qp_access_flags(be32_to_cpu(context->params2));
-	qp_attr->cap.max_send_wr     = qp->sq.max;
-	qp_attr->cap.max_recv_wr     = qp->rq.max;
-	qp_attr->cap.max_send_sge    = qp->sq.max_gs;
-	qp_attr->cap.max_recv_sge    = qp->rq.max_gs;
-	qp_attr->cap.max_inline_data = qp->max_inline_data;
 
 	if (qp->transport == RC || qp->transport == UC) {
 		to_ib_ah_attr(dev, &qp_attr->ah_attr, &context->pri_path);
@@ -495,7 +494,16 @@ int mthca_query_qp(struct ib_qp *ibqp, s
 	qp_attr->retry_cnt 	    = (be32_to_cpu(context->params1) >> 16) & 0x7;
 	qp_attr->rnr_retry 	    = context->pri_path.rnr_retry >> 5;
 	qp_attr->alt_timeout 	    = context->alt_path.ackto >> 3;
-	qp_init_attr->cap 	    = qp_attr->cap;
+
+done:
+	qp_attr->cur_qp_state	     = qp_attr->qp_state;
+	qp_attr->cap.max_send_wr     = qp->sq.max;
+	qp_attr->cap.max_recv_wr     = qp->rq.max;
+	qp_attr->cap.max_send_sge    = qp->sq.max_gs;
+	qp_attr->cap.max_recv_sge    = qp->rq.max_gs;
+	qp_attr->cap.max_inline_data = qp->max_inline_data;
+
+	qp_init_attr->cap	     = qp_attr->cap;
 
 out:
 	mthca_free_mailbox(dev, mailbox);


From ogerlitz at voltaire.com  Wed Dec 27 06:03:38 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Wed, 27 Dec 2006 16:03:38 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <458E7402.4000106@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il>
Message-ID: <45927D3A.9030502@voltaire.com>

Eitan Zahavi wrote:
> Hi Or,
> 
> Sorry it took me a while.
> 
> According to the IBTA spec:
> 1. In order for MTU and MTUSelector to have any effect their component 
> mask bits MUST be set to 1 in the query
> 2. Behavior of the SM is defined with small "freedom" to choose between 
> multiple matching MTU values if they exist.
> 3. The table below summarizes all options:
> 
> Assuming the value M represents the lowest MTU on the path
> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
> R represents the MTU value in the request. Similarly R-1 is one below R 
> and R+1 is one above R.
> 
> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
> w. Tavor End Port
> ----------------------------------------------------------------------------------------- 
> 
> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR

Hi Eitan,

Not that it matters too much for the decision if to push this into the 
Open SM, but the SM group here is positive w.r.t to the approach and 
patch you have sent.

However, there are some clarifications i will be happy to get:

1st maybe its clear to everyone expect me, but what do you mean by /ERR 
in the table above, is it what opensm would return before the patch you 
suggested?

2nd can you post the open sm tavor quirk patch?

3rd Eitan/Michael: what is the bigger picture here? what is the 
dependency between these four patches

+1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors
+2 osm: tavor quirk
+3 IB/rdmacm: tavor quirk
+4 IB/ipoib: use appropriate mtu selector for path queries

for example is it correct that:

if [2] is applied on the SA side then [4] must be applied on ipoib else 
if will get 1K mtu on its path query?

if [2] is not applied on the SA side, then [3] is useless?

Or.


From jsquyres at cisco.com  Wed Dec 27 06:15:15 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 09:15:15 -0500
Subject: [openib-general] Old svn repository access
In-Reply-To: <5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>
References: <1167160526.29620.7478.camel@hal.voltaire.com>
	<5DA9B31A-650E-4A9C-9CE9-EAFF4C9406C7@cisco.com>
Message-ID: <10526AA0-00ED-4A25-84DD-BD70E09277AF@cisco.com>

After poking around some more, I see that SVN https access was half  
setup.  I've fixed it -- you can now access SVN via authenticated  
channels at:

	https://svn.openfabrics.org/svn/openib/

And anonymous channels (commits are disallowed here):

	http://svn.openfabrics.org/svn/openib/

Please let me know if you have any problems with it; sorry for the  
mix-up. :-(

More details on apache and SVN coming soon.


On Dec 27, 2006, at 8:13 AM, Jeff Squyres wrote:

> This is probably my fault; sorry.  :-(
>
> I advised Sandia that it would be ok to turn off the old server, but
> I thought that the new server was up and running.  Doing some poking
> around on staging.ofa, I see that the SVN repository is located at
> file:///data/svn, but I don't see that it's being made available via
> http[s].
>
> I'll poke around today and see if I can get it up and running via http
> [s] on svn.openfabrics.org.
>
>
>
> On Dec 26, 2006, at 2:15 PM, Hal Rosenstock wrote:
>
>> Hi,
>>
>> Thought the old svn repository was made RO. When I do a RO
>> operation to
>> it, I get the following error:
>>
>> svn log | more
>> (R)eject, accept (t)emporarily or accept (p)ermanently? svn:
>> PROPFIND request failed on '/svn/gen2/branches/1.1/src/userspace/
>> management/diags/src/ibnetdiscover.c'
>> svn: PROPFIND of '/svn/gen2/branches/1.1/src/userspace/management/
>> diags/src/ibnetdiscover.c': 405 Method Not Allowed (https://
>> openib.org)
>>
>> Shouldn't this work ?
>>
>> -- Hal
>>
>>
>> _______________________________________________
>> openib-general mailing list
>> openib-general at openib.org
>> http://openib.org/mailman/listinfo/openib-general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/
>> openib-general
>
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/ 
> openib-general


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From eitan at mellanox.co.il  Wed Dec 27 06:21:47 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Wed, 27 Dec 2006 16:21:47 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <45927D3A.9030502@voltaire.com>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com>
Message-ID: <4592817B.3030700@mellanox.co.il>

Or Gerlitz wrote:
> Eitan Zahavi wrote:
>   
>> Hi Or,
>>
>> Sorry it took me a while.
>>
>> According to the IBTA spec:
>> 1. In order for MTU and MTUSelector to have any effect their component 
>> mask bits MUST be set to 1 in the query
>> 2. Behavior of the SM is defined with small "freedom" to choose between 
>> multiple matching MTU values if they exist.
>> 3. The table below summarizes all options:
>>
>> Assuming the value M represents the lowest MTU on the path
>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
>> R represents the MTU value in the request. Similarly R-1 is one below R 
>> and R+1 is one above R.
>>
>> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM Quirk 
>> w. Tavor End Port
>> ----------------------------------------------------------------------------------------- 
>>
>> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
>> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, M, 1K)
>> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
>> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
>>     
>
> Hi Eitan,
>
> Not that it matters too much for the decision if to push this into the 
> Open SM, but the SM group here is positive w.r.t to the approach and 
> patch you have sent.
>
> However, there are some clarifications i will be happy to get:
>
> 1st maybe its clear to everyone expect me, but what do you mean by /ERR 
> in the table above, is it what opensm would return before the patch you 
> suggested?
>   
Hi Or,

By ERR I mean that the path being evaluated is rejected from being 
included in the paths group of the response to the provided query.

> 2nd can you post the open sm tavor quirk patch?
>   
What do you mean? The old patch introducing the "opensm quirk" mode?
It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by 
03e3b3a6fa934202c0f4270a2c69d64ac486b1ca
or SVN: 9497 followed by 9518
> 3rd Eitan/Michael: what is the bigger picture here? what is the 
> dependency between these four patches
>
> +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors
>   
Required - OpenSM broken otherwise
> +2 osm: tavor quirk
>   
Required - if want to rely on OpenSM for selecting 1K MTU for Tavor 
paths if it has the freedom to do so
> +3 IB/rdmacm: tavor quirk
> +4 IB/ipoib: use appropriate mtu selector for path queries
>   
I will let Michael answer that
> for example is it correct that:
>
> if [2] is applied on the SA side then [4] must be applied on ipoib else 
> if will get 1K mtu on its path query?
>
> if [2] is not applied on the SA side, then [3] is useless?
>
> Or.
>
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From vlad at dev.mellanox.co.il  Wed Dec 27 06:34:36 2006
From: vlad at dev.mellanox.co.il (Vladimir Sokolovsky)
Date: Wed, 27 Dec 2006 16:34:36 +0200
Subject: [openib-general] [PATCH] [MINOR] ipoibtools: fix compilation
 errors on ppc64
In-Reply-To: <1167148716.7006.17.camel@muscida>
References: <1167148716.7006.17.camel@muscida>
Message-ID: <4592847C.5030408@dev.mellanox.co.il>

Applied.
Thanks,

Regards,
Vladimir

Yosef Etigin wrote:
> Fix compilation errors of ipoibtools on ppc64 caused by 
> overriding CFLAGS in the Makefile.
>
> Signed-off-by: Yosef Etigin <yosefe at voltaire.com>
>
> ---
> diff -ur a/src/userspace/ipoibtools/iproute2/Makefile b/src/userspace/ipoibtools/iproute2/Makefile
> --- a/src/userspace/ipoibtools/iproute2/Makefile	2006-12-25 16:18:43.000000000 +0200
> +++ b/src/userspace/ipoibtools/iproute2/Makefile	2006-12-25 15:54:40.000000000 +0200
> @@ -22,7 +22,7 @@
>  CC = gcc
>  HOSTCC = gcc
>  CCOPTS = -D_GNU_SOURCE -O2 -Wstrict-prototypes -Wall
> -CFLAGS = $(CCOPTS) -I../include $(DEFINES)
> +CFLAGS += $(CCOPTS) -I../include $(DEFINES)
>  YACCFLAGS = -d -t -v
>  
>  LDLIBS += -L../lib -lnetlink -lutil
>
> --
> Yosef Etigin
> Voltaire
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From jsquyres at cisco.com  Wed Dec 27 07:28:52 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 10:28:52 -0500
Subject: [openib-general] DNS: "git.openfabrics.org" now exists
Message-ID: <6440C91B-ED19-4EFA-B9DC-8EC7DDBA5E54@cisco.com>

The name "git.openfabrics.org" now exists in DNS and points to the  
new server.

I would strongly encourage everyone to start using  
"git.openfabrics.org" as the hostname to access your git repositories  
(vs. "staging.openfabrics.org").  Relevant web pages, documentation,  
etc. should also be updated with this new hostname.

The name "staging.openfabrics.org" was intended to be temporary.  I  
propose for it to go away end of Q1'07 (March 31 2007).

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From jsquyres at cisco.com  Wed Dec 27 07:35:10 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 10:35:10 -0500
Subject: [openib-general] New server: Apache / SSL / IP addresses
Message-ID: <37E1A658-F4B1-405C-81AA-C31E563D454B@cisco.com>

We have the following services on the OFA server that use  
authentication, and therefore use Apache's SSL services:

- subversion
- bugzilla
- tiki

Due to the nature of SSL connections, you can only have one SSL vhost  
per IP address.  Specifically, you cannot have https:// 
foo.example.com and https://bar.example.com be distinct vhosts on the  
same IP address.  This fact, along with the fact that we currently  
only have one IP address active on the new server, prevents the use  
of multiple <foo>.openfabrics.org hostnames for different SSL/ 
authenticated services through Apache.

johncompanies.com lists the hosted servers plan as coming with 5 IP  
addresses.  Is this the plan that we got?  If so, can we request our  
3 of our 4 additional IP addresses?  (who is the OFA contact with  
johncompanies.com?)

I propose the following:

IP address 1 (146.246.248.81):
- http://www.openfabrics.org/ -- main web site
- https://www.openfabrics.org/ -- redirects back to http
- http://builds.openfabrics.org/ -- nightly builds
- http://git.openfabrics.org/ -- gitweb access
   ==> Also use git://git.openfabrics.org/ for normal git access (not  
through Apache, of course)
- http://<foo>.openfabrics.org/ -- ...any other non-authenticated vhost

IP address 2:
- http://bugs.openfabrics.org/ -- redirects to https
- https://bugs.openfabrics.org/ -- all bugzilla access

IP address 3:
- http://wiki.openfabrics.org/ -- read only wiki access
- https://wiki.openfabrics.org/ -- authentication wiki access (I  
don't know if it's possible to separate these two with tiki; if not,  
just have http redirect to https)

IP address 4:
- http://svn.openfabrics.org/ -- read only SVN access
- https://svn.openfabrics.org/ -- authenticated SVN access
==> this vhost to possibly go away end of Q1'07

Comments?

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From kliteyn at dev.mellanox.co.il  Wed Dec 27 07:46:55 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 17:46:55 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
	compatability
Message-ID: <4592956F.3020501@dev.mellanox.co.il>

Hi Hal.

Fixing windows compilation problems
[V2 - Previous patch had an error]

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/include/iba/ib_types.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
index 723e8b9..ec65b64 100644
--- a/osm/include/iba/ib_types.h
+++ b/osm/include/iba/ib_types.h
@@ -59,9 +59,10 @@ BEGIN_C_DECLS
          #define OSM_EXPORT	__declspec(dllimport)
     #endif
     #define OSM_API __stdcall
+    #define OSM_CDECL __cdecl
 #else
     #define OSM_EXPORT	extern
     #define OSM_API
+    #define OSM_CDECL
     #define __ptr64
 #endif
 
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 27 07:47:23 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 17:47:23 +0200
Subject: [openib-general] [PATCH] osm: additional check of tree topology
Message-ID: <4592958B.7030102@dev.mellanox.co.il>

Hi Hal

As we've discussed before - added check for fat-tree topology
to be at least of rank 2.

--
Yevgeny

Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
 
Subject: [PATCH] Added additional check of tree topology

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/opensm/osm_ucast_ftree.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
index 054e3c9..0473135 100644
--- a/osm/opensm/osm_ucast_ftree.c
+++ b/osm/opensm/osm_ucast_ftree.c
@@ -2877,6 +2877,11 @@ __osm_ftree_construct_fabric(
                  "Fabric rank is %u (>%u) - "
                  "fat-tree routing falls back to default routing\n",
                  __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MAX_RANK);
+      else if (__osm_ftree_fabric_get_rank(p_ftree) < FAT_TREE_MIN_RANK)
+         osm_log(&p_ftree->p_osm->log, OSM_LOG_SYS,
+                 "Fabric rank is %u (<%u) - "
+                 "fat-tree routing falls back to default routing\n",
+                 __osm_ftree_fabric_get_rank(p_ftree), FAT_TREE_MIN_RANK);
       status = -1;
       goto Exit;
    }
-- 
1.4.4.1.GIT


From kliteyn at dev.mellanox.co.il  Wed Dec 27 08:19:23 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 18:19:23 +0200
Subject: [openib-general] [PATCH] osm: fat-tree documentation
Message-ID: <45929D0B.3090308@dev.mellanox.co.il>

Hi Hal.

Added fat-tree routing details and some cosmetics in the txt files.

--
Yevgeny

Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
---
 osm/doc/current-routing.txt |   57 ++++++++++++++++++++++++++++++++++++++----
 osm/doc/modular-routing.txt |    4 +-
 2 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/osm/doc/current-routing.txt b/osm/doc/current-routing.txt
index e58ae1f..da050c6 100644
--- a/osm/doc/current-routing.txt
+++ b/osm/doc/current-routing.txt
@@ -1,5 +1,5 @@
 Current OpenSM Routing
-12/20/06
+12/27/06
 
 OpenSM offers three routing engines:
 
@@ -11,11 +11,10 @@ node, but it is constrained to ranking r
 if the subnet is not a pure Fat Tree, and deadlock may occur due to a 
 loop in the subnet.
 
-3.  Fat Tree Unicast routing algorithm - this algorithm optimizes routing
-for congestion-free "shift" communication pattern. 
-It should be chosen if a subnet is a symmetrical Fat Trees of various types,
-not just K-ary-N-Trees: non-constant K, not fully staffed, any CBB ratio.
-Similar to UPDN, Fat Tree routing is constrained to ranking rules.
+3.  Fat-tree Unicast routing algorithm - this algorithm optimizes routing
+Of fat-trees for congestion-free "shift" communication pattern. 
+It should be chosen if a subnet is a symmetrical fat-tree. 
+Similar to UPDN, Fat-tree routing is credit-loop-free.
 
 OpenSM now also offers a file method which can load routes from a table. See 
 modular-routing.txt for more information on this.
@@ -73,6 +72,7 @@ switches will be skipped. Multicast is n
 
 
 Min Hop Algorithm
+-----------------
 
 The Min Hop algorithm is invoked when neither UPDN or the file method are
 specified.
@@ -91,6 +91,9 @@ port GUID. The latter is supplied by:
 LMC awareness routes based on (remote) system or switch basis.
 
 
+UPDN Routing Algorithm
+----------------------
+
 Purpose of UPDN Algorithm
 
 The UPDN algorithm is designed to prevent deadlocks from occurring in loops 
@@ -151,3 +154,45 @@ To learn more about deadlock-free routin
 "Deadlock Free Message Routing in Multiprocessor Interconnection Networks" 
 by William J Dally and Charles L Seitz (1985).
 
+
+Fat-tree Routing Algorithm
+--------------------------
+
+Purpose:
+
+The fat-tree algorithm optimizes routing for "shift" communication pattern. 
+It should be chosen if a subnet is a symmetrical fat-tree of various types.
+It supports not just K-ary-N-Trees, by handling for non-constant K, 
+cases where not all leafs (HCAs) are present, any CBB ratio.
+As in UPDN, fat-tree also prevents credit-loop-deadlocks.
+Fat-tree algorithm supports topologies that comply with the following rules:
+  - Tree rank should be between two and eight (inclusively)
+  - Switches of the same rank should have the same number
+    of UP-going port groups*, unless they are root switches,
+    in which case the shouldn't have UP-going ports at all.
+  - Switches of the same rank should have the same number
+    of DOWN-going port groups, unless they are leaf switches.
+  - Switches of the same rank should have the same number
+    of ports in each UP-going port group.
+  - Switches of the same rank should have the same number
+    of ports in each DOWN-going port group.
+*ports that are connected to the same remote switch are referenced as 'port group'. 
+
+Note that although fat-tree algorithm supports trees with non-integer CBB 
+ratio, the routing will not be as balanced as in case of integer CBB ratio.
+In addition to this, although the algorithm allows leaf switches to have any 
+number of HCAs, the closer the tree to be fully populated, the more effective
+the "shift" communication pattern will be.
+
+The algorithm also dumps HCA ordering file (osm-ftree-ca-order.dump) in the
+same directory where the OpenSM log resides. This ordering file provides the 
+HCA order that may be used to create efficient communication pattern, that
+will match the routing tables.
+
+
+Usage:
+
+Activation through OpenSM
+
+Use '-R ftree' option to activate the fat-tree algorithm.
+
diff --git a/osm/doc/modular-routing.txt b/osm/doc/modular-routing.txt
index 3708e1b..86677d0 100644
--- a/osm/doc/modular-routing.txt
+++ b/osm/doc/modular-routing.txt
@@ -6,8 +6,8 @@ for ease of "plugging" new routing modul
 Currently, only unicast callbacks are supported. Multicast
 can be added later.
 
-One existing routing module is up-down "updn", which may be
-activate with '-R updn' option (instead of old '-u').
+One of existing routing modules is up-down "updn", which may
+be activate with '-R updn' option (instead of old '-u').
 
 General usage is:
 $ opensm -R 'module-name'
-- 
1.4.4.1.GIT


From halr at voltaire.com  Wed Dec 27 08:37:26 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 11:37:26 -0500
Subject: [openib-general] [PATCH 0/4] OpenSM: Add optional SA
	SwitchInfoRecord support
Message-ID: <1167237443.29620.74762.camel@hal.voltaire.com>

OpenSM: Add optional SA SwitchInfoRecord support

This patch adds suppport for the optional SA SwitchInfoRecord.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>


From halr at voltaire.com  Wed Dec 27 08:41:20 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 11:41:20 -0500
Subject: [openib-general] [PATCH 1/4] OpenSM/ib_types.h: Add needed
 SwitchInfoRecord component masks
Message-ID: <1167237447.29620.74764.camel@hal.voltaire.com>

OpenSM/ib_types.h: Add needed SwitchInfoRecord component masks

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
index 91304e2..897e839 100644
--- a/osm/include/iba/ib_types.h
+++ b/osm/include/iba/ib_types.h
@@ -2361,6 +2361,10 @@ typedef struct _ib_path_rec
 #define IB_PKEY_COMPMASK_BLOCK            (CL_HTON64(((uint64_t)1)<<1))
 #define IB_PKEY_COMPMASK_PORT             (CL_HTON64(((uint64_t)1)<<2))
 
+/* Switch Info Record Masks */
+#define IB_SWIR_COMPMASK_LID		  (CL_HTON64(((uint64_t)1)<<0))
+#define IB_SWIR_COMPMASK_RESERVED1	  (CL_HTON64(((uint64_t)1)<<1))
+
 /* LFT Record Masks */
 #define IB_LFTR_COMPMASK_LID              (CL_HTON64(((uint64_t)1)<<0))
 #define IB_LFTR_COMPMASK_BLOCK            (CL_HTON64(((uint64_t)1)<<1))


From halr at voltaire.com  Wed Dec 27 08:41:25 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 11:41:25 -0500
Subject: [openib-general] [PATCH 2/4] OpenSM: Add optional SA
	SwitchInfoRecord support
Message-ID: <1167237674.29620.74964.camel@hal.voltaire.com>

OpenSM: Add optional SA SwitchInfoRecord support

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_sa_sw_info_record.h b/osm/include/opensm/osm_sa_sw_info_record.h
new file mode 100644
index 0000000..c6b421f
--- /dev/null
+++ b/osm/include/opensm/osm_sa_sw_info_record.h
@@ -0,0 +1,306 @@
+/*
+ * Copyright (c) 2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * 	Declaration of osm_sir_rcv_t.
+ *	This object represents the SwitchInfo Receiver object.
+ *	attribute from a switch node.
+ *	This object is part of the OpenSM family of objects.
+ *
+ * Environment:
+ * 	Linux User Mode
+ *
+ */
+
+#ifndef _OSM_SIR_RCV_H_
+#define _OSM_SIR_RCV_H_
+
+#include <complib/cl_passivelock.h>
+#include <opensm/osm_base.h>
+#include <opensm/osm_madw.h>
+#include <opensm/osm_req.h>
+#include <opensm/osm_state_mgr.h>
+#include <opensm/osm_sa_response.h>
+#include <opensm/osm_subnet.h>
+#include <opensm/osm_log.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/****h* OpenSM/Switch Info Receiver
+* NAME
+*	Switch Info Receiver
+*
+* DESCRIPTION
+*	The Switch Info Receiver object encapsulates the information
+*	needed to receive the SwitchInfo attribute from a switch node.
+*
+*	The Switch Info Receiver object is thread safe.
+*
+*	This object should be treated as opaque and should be
+*	manipulated only through the provided functions.
+*
+* AUTHOR
+*	Hal Rosenstock, Voltaire 
+*
+*********/
+
+/****s* OpenSM: Switch Info Receiver/osm_sir_rcv_t
+* NAME
+*	osm_sir_rcv_t
+*
+* DESCRIPTION
+*	Switch Info Receiver structure.
+*
+*	This object should be treated as opaque and should
+*	be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+typedef struct _osm_sir_rcv
+{
+	osm_subn_t				*p_subn;
+	osm_sa_resp_t				*p_resp;
+	osm_mad_pool_t				*p_mad_pool;
+	osm_log_t				*p_log;
+	osm_req_t				*p_req;
+	osm_state_mgr_t				*p_state_mgr;
+	cl_plock_t				*p_lock;
+	cl_qlock_pool_t				pool;
+} osm_sir_rcv_t;
+/*
+* FIELDS
+*	p_subn
+*		Pointer to the Subnet object for this subnet.
+*
+*	p_log
+*		Pointer to the log object.
+*
+*	p_req
+*		Pointer to the Request object.
+*
+*	p_state_mgr
+*		Pointer to the State Manager object.
+*
+*	p_lock
+*		Pointer to the serializing lock.
+*
+* SEE ALSO
+*	Switch Info Receiver object
+*********/
+
+/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_construct
+* NAME
+*	osm_sir_rcv_construct
+*
+* DESCRIPTION
+*	This function constructs a Switch Info Receiver object.
+*
+* SYNOPSIS
+*/
+void osm_sir_rcv_construct(
+	IN osm_sir_rcv_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to a Switch Info Receiver object to construct.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Allows calling osm_sir_rcv_init, osm_sir_rcv_destroy,
+*	and osm_sir_rcv_is_inited.
+*
+*	Calling osm_sir_rcv_construct is a prerequisite to calling any other
+*	method except osm_sir_rcv_init.
+*
+* SEE ALSO
+*	Switch Info Receiver object, osm_sir_rcv_init,
+*	osm_sir_rcv_destroy, osm_sir_rcv_is_inited
+*********/
+
+/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_destroy
+* NAME
+*	osm_sir_rcv_destroy
+*
+* DESCRIPTION
+*	The osm_sir_rcv_destroy function destroys the object, releasing
+*	all resources.
+*
+* SYNOPSIS
+*/
+void osm_sir_rcv_destroy(
+	IN osm_sir_rcv_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to the object to destroy.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Performs any necessary cleanup of the specified
+*	Switch Info Receiver object.
+*	Further operations should not be attempted on the destroyed object.
+*	This function should only be called after a call to
+*	osm_sir_rcv_construct or osm_sir_rcv_init.
+*
+* SEE ALSO
+*	Switch Info Receiver object, osm_sir_rcv_construct,
+*	osm_sir_rcv_init
+*********/
+
+/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_init
+* NAME
+*	osm_sir_rcv_init
+*
+* DESCRIPTION
+*	The osm_sir_rcv_init function initializes a
+*	Switch Info Receiver object for use.
+*
+* SYNOPSIS
+*/
+ib_api_status_t osm_sir_rcv_init(
+	IN osm_sir_rcv_t* const p_rcv,
+	IN osm_sa_resp_t* const p_resp,
+	IN osm_mad_pool_t* const p_mad_pool,
+	IN osm_subn_t* const p_subn,
+	IN osm_log_t* const p_log,
+	IN cl_plock_t* const p_lock );
+/*
+* PARAMETERS
+*	p_rcv
+*		[in] Pointer to an osm_sir_rcv_t object to initialize.
+*
+*	p_resp
+*		[in] Pointer to the SA Responder object.
+*
+*	p_mad_pool
+*		[in] Pointer to the mad pool.
+*
+*	p_subn
+*		[in] Pointer to the Subnet object for this subnet.
+*
+*	p_log
+*		[in] Pointer to the log object.
+*
+*	p_lock
+*		[in] Pointer to the OpenSM serializing lock.
+*
+* RETURN VALUES
+*	IB_SUCCESS if the Switch Info Receiver object was initialized
+*	successfully.
+*
+* NOTES
+*	Allows calling other Switch Info Receiver methods.
+*
+* SEE ALSO
+*	Switch Info Receiver object, osm_sir_rcv_construct,
+*	osm_sir_rcv_destroy, osm_sir_rcv_is_inited
+*********/
+
+/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_is_inited
+* NAME
+*	osm_sir_rcv_is_inited
+*
+* DESCRIPTION
+*	Indicates if the object has been initialized with osm_sir_rcv_init.
+*
+* SYNOPSIS
+*/
+boolean_t osm_sir_rcv_is_inited(
+	IN const osm_sir_rcv_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_sir_rcv_t object.
+*
+* RETURN VALUES
+*	TRUE if the object was initialized successfully,
+*	FALSE otherwise.
+*
+* NOTES
+*	The osm_sir_rcv_construct or osm_sir_rcv_init must be
+*	called before using this function.
+*
+* SEE ALSO
+*	Switch Info Receiver object, osm_sir_rcv_construct,
+*	osm_sir_rcv_init
+*********/
+
+/****f* OpenSM: Switch Info Receiver/osm_sir_rcv_process
+* NAME
+*	osm_sir_rcv_process
+*
+* DESCRIPTION
+*	Process the SwitchInfo attribute.
+*
+* SYNOPSIS
+*/
+void osm_sir_rcv_process(
+	IN osm_sir_rcv_t* const p_ctrl,
+	IN const osm_madw_t*   const p_madw );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_sir_rcv_t object.
+*
+*	p_madw
+*		[in] Pointer to the MAD Wrapper containing the MAD
+*		that contains the node's SwitchInfo attribute.
+*
+* RETURN VALUES
+*	CL_SUCCESS if the SwitchInfo processing was successful.
+*
+* NOTES
+*	This function processes a SwitchInfo attribute.
+*
+* SEE ALSO
+*	Switch Info Receiver, Switch Info Response Controller
+*********/
+
+END_C_DECLS
+
+#endif	/* _OSM_SIR_RCV_H_ */
diff --git a/osm/include/opensm/osm_sa_sw_info_record_ctrl.h b/osm/include/opensm/osm_sa_sw_info_record_ctrl.h
new file mode 100644
index 0000000..b58654f
--- /dev/null
+++ b/osm/include/opensm/osm_sa_sw_info_record_ctrl.h
@@ -0,0 +1,259 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * 	Declaration of osm_sir_rcv_ctrl_t.
+ *	This object represents a controller that receives the IBA SwitchInfo
+ *	attribute from a switch node.
+ *	This object is part of the OpenSM family of objects.
+ *
+ * Environment:
+ * 	Linux User Mode
+ *
+ */
+
+#ifndef _OSM_SIR_RCV_CTRL_H_
+#define _OSM_SIR_RCV_CTRL_H_
+
+#include <complib/cl_dispatcher.h>
+#include <opensm/osm_base.h>
+#include <opensm/osm_madw.h>
+#include <opensm/osm_sa_sw_info_record.h>
+#include <opensm/osm_log.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/****h* OpenSM/Switch Info Receive Controller
+* NAME
+*	Switch Info Receive Controller
+*
+* DESCRIPTION
+*	The Switch Info Receive Controller object encapsulates the information
+*	needed to receive the SwitchInfo attribute from a switch node.
+*
+*	The Switch Info Receive Controller object is thread safe.
+*
+*	This object should be treated as opaque and should be
+*	manipulated only through the provided functions.
+*
+* AUTHOR
+*	Hal Rosenstock, Voltaire
+*
+*********/
+
+/****s* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_t
+* NAME
+*	osm_sir_rcv_ctrl_t
+*
+* DESCRIPTION
+*	Switch Info Receive Controller structure.
+*
+*	This object should be treated as opaque and should
+*	be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+typedef struct _osm_sir_rcv_ctrl
+{
+	osm_sir_rcv_t			*p_rcv;
+	osm_log_t			*p_log;
+	cl_dispatcher_t			*p_disp;
+	cl_disp_reg_handle_t		h_disp;
+} osm_sir_rcv_ctrl_t;
+/*
+* FIELDS
+*	p_rcv
+*		Pointer to the Switch Info Receiver object.
+*
+*	p_log
+*		Pointer to the log object.
+*
+*	p_disp
+*		Pointer to the Dispatcher.
+*
+*	h_disp
+*		Handle returned from dispatcher registration.
+*
+* SEE ALSO
+*	Switch Info Receive Controller object
+*	Switch Info Receiver object
+*********/
+
+/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_construct
+* NAME
+*	osm_sir_rcv_ctrl_construct
+*
+* DESCRIPTION
+*	This function constructs a Switch Info Receive Controller object.
+*
+* SYNOPSIS
+*/
+void osm_sir_rcv_ctrl_construct(
+	IN osm_sir_rcv_ctrl_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to a Switch Info Receive Controller
+*		object to construct.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Allows calling osm_sir_rcv_ctrl_init, osm_sir_rcv_ctrl_destroy,
+*	and osm_sir_rcv_ctrl_is_inited.
+*
+*	Calling osm_sir_rcv_ctrl_construct is a prerequisite to calling any
+*	other method except osm_sir_rcv_ctrl_init.
+*
+* SEE ALSO
+*	Switch Info Receive Controller object, osm_sir_rcv_ctrl_init,
+*	osm_sir_rcv_ctrl_destroy, osm_sir_rcv_ctrl_is_inited
+*********/
+
+/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_destroy
+* NAME
+*	osm_sir_rcv_ctrl_destroy
+*
+* DESCRIPTION
+*	The osm_sir_rcv_ctrl_destroy function destroys the object, releasing
+*	all resources.
+*
+* SYNOPSIS
+*/
+void osm_sir_rcv_ctrl_destroy(
+	IN osm_sir_rcv_ctrl_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to the object to destroy.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Performs any necessary cleanup of the specified
+*	Switch Info Receive Controller object.
+*	Further operations should not be attempted on the destroyed object.
+*	This function should only be called after a call to
+*	osm_sir_rcv_ctrl_construct or osm_sir_rcv_ctrl_init.
+*
+* SEE ALSO
+*	Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct,
+*	osm_sir_rcv_ctrl_init
+*********/
+
+/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_init
+* NAME
+*	osm_sir_rcv_ctrl_init
+*
+* DESCRIPTION
+*	The osm_sir_rcv_ctrl_init function initializes a
+*	Switch Info Receive Controller object for use.
+*
+* SYNOPSIS
+*/
+ib_api_status_t osm_sir_rcv_ctrl_init(
+	IN osm_sir_rcv_ctrl_t* const p_ctrl,
+	IN osm_sir_rcv_t* const p_rcv,
+	IN osm_log_t* const p_log,
+	IN cl_dispatcher_t* const p_disp );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_sir_rcv_ctrl_t object to initialize.
+*
+*	p_rcv
+*		[in] Pointer to an osm_sir_rcv_t object.
+*
+*	p_log
+*		[in] Pointer to the log object.
+*
+*	p_disp
+*		[in] Pointer to the OpenSM central Dispatcher.
+*
+* RETURN VALUES
+*	CL_SUCCESS if the Switch Info Receive Controller object was initialized
+*	successfully.
+*
+* NOTES
+*	Allows calling other Switch Info Receive Controller methods.
+*
+* SEE ALSO
+*	Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct,
+*	osm_sir_rcv_ctrl_destroy, osm_sir_rcv_ctrl_is_inited
+*********/
+
+/****f* OpenSM: Switch Info Receive Controller/osm_sir_rcv_ctrl_is_inited
+* NAME
+*	osm_sir_rcv_ctrl_is_inited
+*
+* DESCRIPTION
+*	Indicates if the object has been initialized with osm_sir_rcv_ctrl_init.
+*
+* SYNOPSIS
+*/
+boolean_t osm_sir_rcv_ctrl_is_inited(
+	IN const osm_sir_rcv_ctrl_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_sir_rcv_ctrl_t object.
+*
+* RETURN VALUES
+*	TRUE if the object was initialized successfully,
+*	FALSE otherwise.
+*
+* NOTES
+*	The osm_sir_rcv_ctrl_construct or osm_sir_rcv_ctrl_init must be
+*	called before using this function.
+*
+* SEE ALSO
+*	Switch Info Receive Controller object, osm_sir_rcv_ctrl_construct,
+*	osm_sir_rcv_ctrl_init
+*********/
+
+END_C_DECLS
+
+#endif	/* _OSM_SIR_RCV_CTRL_H_ */
diff --git a/osm/opensm/osm_sa_sw_info_record.c b/osm/opensm/osm_sa_sw_info_record.c
new file mode 100644
index 0000000..2da30ba
--- /dev/null
+++ b/osm/opensm/osm_sa_sw_info_record.c
@@ -0,0 +1,530 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of osm_sir_rcv_t.
+ * This object represents the SwitchInfo Receiver object.
+ * This object is part of the opensm family of objects.
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <string.h>
+#include <iba/ib_types.h>
+#include <complib/cl_debug.h>
+#include <complib/cl_qlist.h>
+#include <opensm/osm_sa_sw_info_record.h>
+#include <opensm/osm_node.h>
+#include <vendor/osm_vendor_api.h>
+#include <opensm/osm_helper.h>
+#include <opensm/osm_pkey.h>
+
+#define OSM_SIR_RCV_POOL_MIN_SIZE    32
+#define OSM_SIR_RCV_POOL_GROW_SIZE   32
+
+typedef  struct _osm_sir_item
+{
+  cl_pool_item_t           pool_item;
+  ib_switch_info_record_t  rec;
+} osm_sir_item_t;
+
+typedef  struct _osm_sir_search_ctxt
+{
+  const ib_switch_info_record_t* p_rcvd_rec;
+  ib_net64_t               comp_mask;
+  cl_qlist_t*              p_list;
+  osm_sir_rcv_t*           p_rcv;
+  const osm_physp_t*       p_req_physp;
+} osm_sir_search_ctxt_t;
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_sir_rcv_construct(
+  IN osm_sir_rcv_t* const p_rcv )
+{
+  memset( p_rcv, 0, sizeof(*p_rcv) );
+  cl_qlock_pool_construct( &p_rcv->pool );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_sir_rcv_destroy(
+  IN osm_sir_rcv_t* const p_rcv )
+{
+  OSM_LOG_ENTER( p_rcv->p_log, osm_sir_rcv_destroy );
+  cl_qlock_pool_destroy( &p_rcv->pool );
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_sir_rcv_init(
+  IN osm_sir_rcv_t*        const p_rcv,
+  IN osm_sa_resp_t*        const p_resp,
+  IN osm_mad_pool_t*       const p_mad_pool,
+  IN osm_subn_t*           const p_subn,
+  IN osm_log_t*            const p_log,
+  IN cl_plock_t*           const p_lock )
+{
+  ib_api_status_t          status;
+
+  OSM_LOG_ENTER( p_log, osm_sir_rcv_init );
+
+  osm_sir_rcv_construct( p_rcv );
+
+  p_rcv->p_log = p_log;
+  p_rcv->p_subn = p_subn;
+  p_rcv->p_lock = p_lock;
+  p_rcv->p_resp = p_resp;
+  p_rcv->p_mad_pool = p_mad_pool;
+
+  status = cl_qlock_pool_init( &p_rcv->pool,
+                               OSM_SIR_RCV_POOL_MIN_SIZE,
+                               0,
+                               OSM_SIR_RCV_POOL_GROW_SIZE,
+                               sizeof(osm_sir_item_t),
+                               NULL, NULL, NULL );
+
+  OSM_LOG_EXIT( p_log );
+  return( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static ib_api_status_t
+__osm_sir_rcv_new_sir(
+  IN osm_sir_rcv_t*        const p_rcv,
+  IN const osm_switch_t*   const p_sw,
+  IN cl_qlist_t*           const p_list,
+  IN ib_net16_t            const lid )
+{
+  osm_sir_item_t*          p_rec_item;
+  ib_api_status_t          status = IB_SUCCESS;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_sir_rcv_new_sir );
+
+  p_rec_item = (osm_sir_item_t*)cl_qlock_pool_get( &p_rcv->pool );
+  if( p_rec_item == NULL )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_sir_rcv_new_sir: ERR 5308: "
+             "cl_qlock_pool_get failed\n" );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_sir_rcv_new_sir: "
+             "New SwitchInfoRecord: lid 0x%X\n",
+             cl_ntoh16( lid )
+             );
+  }
+
+  memset( &p_rec_item->rec, 0, sizeof(ib_switch_info_record_t) );
+
+  p_rec_item->rec.lid = lid;
+  p_rec_item->rec.switch_info = p_sw->switch_info;
+
+  cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+  return( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static osm_port_t*
+__osm_sir_get_port_by_guid(
+  IN osm_sir_rcv_t*   const p_rcv,
+  IN uint64_t         port_guid )
+{
+  osm_port_t*         p_port;
+
+  CL_PLOCK_ACQUIRE(p_rcv->p_lock);
+
+  p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl,
+                                     port_guid);
+  if (p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl))
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_sir_get_port_by_guid ERR 5309: "
+             "Invalid port GUID 0x%016" PRIx64 "\n",
+             port_guid );
+    p_port = NULL;
+  }
+
+  CL_PLOCK_RELEASE(p_rcv->p_lock);
+  return p_port;
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_sir_rcv_create_sir(
+  IN osm_sir_rcv_t*        const p_rcv,
+  IN const osm_switch_t*   const p_sw,
+  IN cl_qlist_t*           const p_list,
+  IN ib_net16_t            const match_lid,
+  IN const osm_physp_t*    const p_req_physp )
+{
+  osm_port_t*              p_port;
+  const osm_physp_t*       p_physp;
+  uint16_t                 match_lid_ho;
+  ib_net16_t               min_lid_ho;
+  ib_net16_t               max_lid_ho;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_sir_rcv_create_sir );
+
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_sir_rcv_create_sir: "
+             "Looking for SwitchInfoRecord with LID: 0x%X\n",
+             cl_ntoh16( match_lid )
+             );
+  }
+
+  /* In switches, the port guid is the node guid. */
+  p_port =
+    __osm_sir_get_port_by_guid( p_rcv, p_sw->p_node->node_info.port_guid );
+  if (! p_port)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_sir_rcv_create_sir: ERR 530A: "
+             "Failed to find Port by Node Guid:0x%016" PRIx64
+             "\n",
+             cl_ntoh64( p_sw->p_node->node_info.node_guid )
+             );
+    goto Exit;
+  }
+
+  /* check that the requester physp and the current physp are under
+     the same partition. */
+  p_physp = osm_port_get_default_phys_ptr( p_port );
+  if (! p_physp)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_sir_rcv_create_sir: ERR 530B: "
+             "Failed to find default physical Port by Node Guid:0x%016" PRIx64
+             "\n",
+             cl_ntoh64( p_sw->p_node->node_info.node_guid )
+             );
+    goto Exit;
+  }
+  if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp ))
+    goto Exit;
+
+  /* get the port 0 of the switch */
+  osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho );
+
+  match_lid_ho = cl_ntoh16( match_lid );
+  if( match_lid_ho )
+  {
+    /*
+      We validate that the lid belongs to this switch.
+    */
+    if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+               "__osm_sir_rcv_create_sir: "
+               "Comparing LID: 0x%X <= 0x%X <= 0x%X\n",
+               min_lid_ho, match_lid_ho, max_lid_ho
+               );
+    }
+
+    if ( match_lid_ho < min_lid_ho || match_lid_ho > max_lid_ho )
+      goto Exit;
+
+  }
+
+  __osm_sir_rcv_new_sir( p_rcv, p_sw, p_list, osm_port_get_base_lid(p_port) );
+
+Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_sir_rcv_by_comp_mask(
+  IN cl_map_item_t*        const p_map_item,
+  IN void*                 context )
+{
+  const osm_sir_search_ctxt_t* const p_ctxt = (osm_sir_search_ctxt_t *)context;
+  const osm_switch_t*      const p_sw = (osm_switch_t*)p_map_item;
+  const ib_switch_info_record_t* const p_rcvd_rec = p_ctxt->p_rcvd_rec;
+  const osm_physp_t*       const p_req_physp = p_ctxt->p_req_physp;
+  osm_sir_rcv_t*           const p_rcv = p_ctxt->p_rcv;
+  ib_net64_t               const comp_mask = p_ctxt->comp_mask;
+  ib_net16_t               match_lid = 0;
+
+  OSM_LOG_ENTER( p_ctxt->p_rcv->p_log, __osm_sir_rcv_by_comp_mask );
+
+  osm_dump_switch_info(
+    p_ctxt->p_rcv->p_log,
+    &p_sw->switch_info,
+    OSM_LOG_VERBOSE );    
+
+  if( comp_mask & IB_SWIR_COMPMASK_LID )
+    match_lid = p_rcvd_rec->lid;
+
+  __osm_sir_rcv_create_sir( p_rcv, p_sw, p_ctxt->p_list,
+                            match_lid, p_req_physp );
+
+  OSM_LOG_EXIT( p_ctxt->p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_sir_rcv_process(
+  IN osm_sir_rcv_t*        const p_rcv,
+  IN const osm_madw_t*     const p_madw )
+{
+  const ib_sa_mad_t*       p_rcvd_mad;
+  const ib_switch_info_record_t*  p_rcvd_rec;
+  ib_switch_info_record_t*        p_resp_rec;
+  cl_qlist_t               rec_list;
+  osm_madw_t*              p_resp_madw;
+  ib_sa_mad_t*             p_resp_sa_mad;
+  uint32_t                 num_rec, pre_trim_num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  uint32_t		   trim_num_rec;
+#endif
+  uint32_t                 i;
+  osm_sir_search_ctxt_t    context;
+  osm_sir_item_t*          p_rec_item;
+  ib_api_status_t          status;
+  osm_physp_t*             p_req_physp;
+
+  CL_ASSERT( p_rcv );
+
+  OSM_LOG_ENTER( p_rcv->p_log, osm_sir_rcv_process );
+
+  CL_ASSERT( p_madw );
+
+  p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw );
+  p_rcvd_rec = (ib_switch_info_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad );
+
+  CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_SWITCH_INFO_RECORD );
+
+  /* we only support SubnAdmGet and SubnAdmGetTable methods */
+  if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) &&
+       (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_sir_rcv_process: ERR 5305: "
+             "Unsupported Method (%s)\n",
+             ib_get_sa_method_str( p_rcvd_mad->method ) );
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR );
+    goto Exit;
+  }
+
+  /* update the requester physical port. */
+  p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log,
+                                          p_rcv->p_subn,
+                                          osm_madw_get_mad_addr_ptr(p_madw) );
+  if (p_req_physp == NULL)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_sir_rcv_process: ERR 5304: "
+             "Cannot find requester physical port\n" );
+    goto Exit;
+  }
+
+  if ( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+    osm_dump_switch_info_record( p_rcv->p_log, p_rcvd_rec, OSM_LOG_DEBUG );
+
+  cl_qlist_init( &rec_list );
+
+  context.p_rcvd_rec    = p_rcvd_rec;
+  context.p_list        = &rec_list;
+  context.comp_mask     = p_rcvd_mad->comp_mask;
+  context.p_rcv         = p_rcv;
+  context.p_req_physp   = p_req_physp;
+
+  cl_plock_acquire( p_rcv->p_lock );
+
+  /* Go over all switches */
+  cl_qmap_apply_func( &p_rcv->p_subn->sw_guid_tbl,
+                      __osm_sir_rcv_by_comp_mask,
+                      &context );
+
+  cl_plock_release( p_rcv->p_lock );
+
+  num_rec = cl_qlist_count( &rec_list );
+
+  /*
+   * C15-0.1.30:
+   * If we do a SubnAdmGet and got more than one record it is an error !
+   */
+  if ( (p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec > 1) ) {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_sir_rcv_process: ERR 5303: "
+             "Got more than one record for SubnAdmGet (%u)\n",
+             num_rec );
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_TOO_MANY_RECORDS );
+
+    /* need to set the mem free ... */
+    p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list );
+    while( p_rec_item != (osm_sir_item_t*)cl_qlist_end( &rec_list ) )
+    {
+      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+      p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list );
+    }
+
+    goto Exit;
+  }
+
+  pre_trim_num_rec = num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we limit the number of records to a single packet */
+  trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_switch_info_record_t);
+  if (trim_num_rec < num_rec)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
+             "osm_sir_rcv_process: "
+             "Number of records:%u trimmed to:%u to fit in one MAD\n",
+             num_rec, trim_num_rec );
+    num_rec = trim_num_rec;
+  }
+#endif
+
+  osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+           "osm_sir_rcv_process: "
+           "Returning %u records\n", num_rec );
+
+  if ((p_rcvd_mad->method == IB_MAD_METHOD_GET) && (num_rec == 0))
+  {
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+    goto Exit;
+  }
+
+  /* 
+   * Get a MAD to reply. Address of Mad is in the received mad_wrapper
+   */
+  p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool,
+                                  p_madw->h_bind,
+                                  num_rec * sizeof(ib_switch_info_record_t) + IB_SA_MAD_HDR_SIZE,
+                                  &p_madw->mad_addr );
+
+  if( !p_resp_madw )
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_sir_rcv_process: ERR 5306: "
+            "osm_mad_pool_get failed\n" );
+
+    for( i = 0; i < num_rec; i++ )
+    {
+      p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list );
+      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    }
+
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RESOURCES );
+    goto Exit;
+  }
+
+  p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw );
+
+  /*
+    Copy the MAD header back into the response mad.
+    Set the 'R' bit and the payload length,
+    Then copy all records from the list into the response payload.
+  */
+
+  memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE );
+  p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK;
+  /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
+  p_resp_sa_mad->sm_key = 0;
+  /* Fill in the offset (paylen will be done by the rmpp SAR) */
+  p_resp_sa_mad->attr_offset =
+    ib_get_attr_offset( sizeof(ib_switch_info_record_t) );
+
+  p_resp_rec = (ib_switch_info_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
+
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we support only one packet RMPP - so we will set the first and
+     last flags for gettable */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+  {
+    p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA;
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE;
+  }
+#else
+  /* forcefully define the packet as RMPP one */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE;
+#endif
+
+  for( i = 0; i < pre_trim_num_rec; i++ )
+  {
+    p_rec_item = (osm_sir_item_t*)cl_qlist_remove_head( &rec_list );
+    /* copy only if not trimmed */
+    if (i < num_rec)
+    {
+      *p_resp_rec = p_rec_item->rec;
+    }
+    cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    p_resp_rec++;
+  }
+
+  CL_ASSERT( cl_is_qlist_empty( &rec_list ) );
+
+  status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE );
+  if (status != IB_SUCCESS)
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_sir_rcv_process: ERR 5307: "
+            "osm_vendor_send status = %s\n",
+            ib_get_err_str(status));
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
diff --git a/osm/opensm/osm_sa_sw_info_record_ctrl.c b/osm/opensm/osm_sa_sw_info_record_ctrl.c
new file mode 100644
index 0000000..daf55cc
--- /dev/null
+++ b/osm/opensm/osm_sa_sw_info_record_ctrl.c
@@ -0,0 +1,123 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of osm_sir_rcv_ctrl_t.
+ * This object represents the SwitchInfo Record controller object.
+ * This object is part of the opensm family of objects.
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <string.h>
+#include <opensm/osm_sa_sw_info_record_ctrl.h>
+#include <opensm/osm_msgdef.h>
+
+/**********************************************************************
+ **********************************************************************/
+void
+__osm_sir_ctrl_disp_callback(
+  IN  void *context,
+  IN  void *p_data )
+{
+  /* ignore return status when invoked via the dispatcher */
+  osm_sir_rcv_process( ((osm_sir_rcv_ctrl_t*)context)->p_rcv,
+                       (osm_madw_t*)p_data );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_sir_rcv_ctrl_construct(
+  IN osm_sir_rcv_ctrl_t* const p_ctrl )
+{
+  memset( p_ctrl, 0, sizeof(*p_ctrl) );
+  p_ctrl->h_disp = CL_DISP_INVALID_HANDLE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_sir_rcv_ctrl_destroy(
+  IN osm_sir_rcv_ctrl_t* const p_ctrl )
+{
+  CL_ASSERT( p_ctrl );
+  cl_disp_unregister( p_ctrl->h_disp );
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_sir_rcv_ctrl_init(
+  IN osm_sir_rcv_ctrl_t* const p_ctrl,
+  IN osm_sir_rcv_t* const p_rcv,
+  IN osm_log_t* const p_log,
+  IN cl_dispatcher_t* const p_disp )
+{
+  ib_api_status_t status = IB_SUCCESS;
+
+  OSM_LOG_ENTER( p_log, osm_sir_rcv_ctrl_init );
+
+  osm_sir_rcv_ctrl_construct( p_ctrl );
+  p_ctrl->p_log = p_log;
+  p_ctrl->p_rcv = p_rcv;
+  p_ctrl->p_disp = p_disp;
+
+  p_ctrl->h_disp = cl_disp_register(
+    p_disp,
+    OSM_MSG_MAD_SWITCH_INFO_RECORD,
+    __osm_sir_ctrl_disp_callback,
+    p_ctrl );
+
+  if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE )
+  {
+    osm_log( p_log, OSM_LOG_ERROR,
+             "osm_sir_rcv_ctrl_init: ERR 5301: "
+             "Dispatcher registration failed\n" );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( p_log );
+  return( status );
+}


From halr at voltaire.com  Wed Dec 27 08:46:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 11:46:27 -0500
Subject: [openib-general] [PATCH 3/4] OpenSM: Other changes to incorporate
 optional SA SwitchInfoRecord support
Message-ID: <1167237684.29620.74966.camel@hal.voltaire.com>

OpenSM: Other changes to incorporate optional SA SwitchInfoRecord
support

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am
index cc90283..d051b9a 100644
--- a/osm/include/Makefile.am
+++ b/osm/include/Makefile.am
@@ -109,6 +109,8 @@ EXTRA_DIST = \
 	$(srcdir)/opensm/osm_sa_link_record_ctrl.h \
 	$(srcdir)/opensm/osm_sw_info_rcv_ctrl.h \
 	$(srcdir)/opensm/osm_sa_mcmember_record.h \
+	$(srcdir)/opensm/osm_sa_sw_info_record_ctrl.h \
+	$(srcdir)/opensm/osm_sa_sw_info_record.h \
 	$(srcdir)/opensm/osm_vl15intf.h \
 	$(srcdir)/opensm/osm_drop_mgr.h \
 	$(srcdir)/opensm/osm_port_info_rcv.h \
diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h
index a9fa613..3611025 100644
--- a/osm/include/opensm/osm_msgdef.h
+++ b/osm/include/opensm/osm_msgdef.h
@@ -195,6 +195,7 @@ enum
 	OSM_MSG_MAD_SLVL,
 	OSM_MSG_MAD_GUIDINFO_RECORD,
 	OSM_MSG_MAD_INFORM_INFO_RECORD,
+	OSM_MSG_MAD_SWITCH_INFO_RECORD,
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
 	OSM_MSG_MAD_MULTIPATH_RECORD,
 #endif
diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h
index 93324b2..ae8d5ac 100644
--- a/osm/include/opensm/osm_sa.h
+++ b/osm/include/opensm/osm_sa.h
@@ -76,6 +76,7 @@
 #include <opensm/osm_sa_vlarb_record_ctrl.h>
 #include <opensm/osm_sa_pkey_record_ctrl.h>
 #include <opensm/osm_sa_lft_record_ctrl.h>
+#include <opensm/osm_sa_sw_info_record_ctrl.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -190,6 +191,10 @@ typedef struct _osm_sa
 	/* LinearForwardingTable Query */
 	osm_lftr_rcv_t				lftr_rcv;
 	osm_lftr_rcv_ctrl_t			lftr_rcv_ctrl;
+
+	/* SwitchInfo Query */
+	osm_sir_rcv_t				sir_rcv;
+	osm_sir_rcv_ctrl_t			sir_rcv_ctrl;
 } osm_sa_t;
 /*
 * FIELDS
diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index 7c09e81..3ef246c 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -77,7 +77,8 @@ opensm_SOURCES = main.c osm_console.c os
 		 osm_sa_service_record_ctrl.c osm_sa_slvl_record.c \
 		 osm_sa_slvl_record_ctrl.c osm_sa_sminfo_record.c \
 		 osm_sa_sminfo_record_ctrl.c osm_sa_vlarb_record.c \
-		 osm_sa_vlarb_record_ctrl.c osm_service.c \
+		 osm_sa_vlarb_record_ctrl.c osm_sa_sw_info_record.c \
+		 osm_sa_sw_info_record_ctrl.c osm_service.c \
 		 osm_slvl_map_rcv.c osm_slvl_map_rcv_ctrl.c \
 		 osm_sm.c osm_sminfo_rcv.c \
 		 osm_sminfo_rcv_ctrl.c osm_sm_mad_ctrl.c \
diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c
index a6c475c..983d5e5 100644
--- a/osm/opensm/osm_sa.c
+++ b/osm/opensm/osm_sa.c
@@ -128,6 +128,9 @@ osm_sa_construct(
 
   osm_lftr_rcv_construct( &p_sa->lftr_rcv );
   osm_lftr_rcv_ctrl_construct( &p_sa->lftr_rcv_ctrl );
+
+  osm_sir_rcv_construct( &p_sa->sir_rcv );
+  osm_sir_rcv_ctrl_construct( &p_sa->sir_rcv_ctrl );
 }
 
 /**********************************************************************
@@ -159,6 +162,7 @@ osm_sa_shutdown(
   osm_slvl_rec_rcv_ctrl_destroy( &p_sa->slvl_rec_rcv_ctrl );
   osm_pkey_rec_rcv_ctrl_destroy( &p_sa->pkey_rec_rcv_ctrl );
   osm_lftr_rcv_ctrl_destroy( &p_sa->lftr_rcv_ctrl );
+  osm_sir_rcv_ctrl_destroy( &p_sa->sir_rcv_ctrl );
   osm_sa_mad_ctrl_destroy( &p_sa->mad_ctrl );
 
   OSM_LOG_EXIT( p_sa->p_log );
@@ -190,6 +194,7 @@ osm_sa_destroy(
   osm_slvl_rec_rcv_destroy( &p_sa->slvl_rec_rcv );
   osm_pkey_rec_rcv_destroy( &p_sa->pkey_rec_rcv );
   osm_lftr_rcv_destroy( &p_sa->lftr_rcv );
+  osm_sir_rcv_destroy( &p_sa->sir_rcv );
   osm_sa_resp_destroy( &p_sa->resp );
 
   OSM_LOG_EXIT( p_sa->p_log );
@@ -514,6 +519,24 @@ osm_sa_init(
   if( status != IB_SUCCESS )
     goto Exit;
 
+  status = osm_sir_rcv_init(
+    &p_sa->sir_rcv,
+    &p_sa->resp,
+    p_sa->p_mad_pool,
+    p_subn,
+    p_log,
+    p_lock);
+  if( status != IB_SUCCESS )
+    goto Exit;
+
+  status = osm_sir_rcv_ctrl_init(
+    &p_sa->sir_rcv_ctrl,
+    &p_sa->sir_rcv,
+    p_log,
+    p_disp );
+  if( status != IB_SUCCESS )
+    goto Exit;
+
  Exit:
   OSM_LOG_EXIT( p_log );
   return( status );
diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c
index 440d773..4d7bcbb 100644
--- a/osm/opensm/osm_sa_class_port_info.c
+++ b/osm/opensm/osm_sa_class_port_info.c
@@ -194,7 +194,6 @@ __osm_cpi_rcv_respond(
   /* set specific capability mask bits */
   /* we do not support the following optional records:
      OSM_CAP_IS_SUBN_OPT_RECS_SUP :
-     SwitchInfoRecord,
      RandomForwardingTableRecord,
      MulticastForwardingTableRecord,
      ServiceAssociationRecord
diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c
index 2605fbf..90c732d 100644
--- a/osm/opensm/osm_sa_mad_ctrl.c
+++ b/osm/opensm/osm_sa_mad_ctrl.c
@@ -212,6 +212,10 @@ __osm_sa_mad_ctrl_process(
     msg_id = OSM_MSG_MAD_INFORM_INFO_RECORD;
     break;
 
+  case IB_MAD_ATTR_SWITCH_INFO_RECORD:
+    msg_id = OSM_MSG_MAD_SWITCH_INFO_RECORD;
+    break;
+
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
   case IB_MAD_ATTR_MULTIPATH_RECORD:
     msg_id = OSM_MSG_MAD_MULTIPATH_RECORD;


From halr at voltaire.com  Wed Dec 27 08:46:36 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 11:46:36 -0500
Subject: [openib-general] [PATCH 4/4] osmtest/osmtest.c: Add SA
	SwitchInfoRecord tests
Message-ID: <1167237690.29620.74968.camel@hal.voltaire.com>

osmtest/osmtest.c: Add SA SwitchInfoRecord tests

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index 0ccc06c..eed390b 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -4677,6 +4677,92 @@ osmtest_get_pkeytbl_rec_by_lid( IN osmte
 }
 
 /**********************************************************************
+ * Get SwitchInfo record by LID
+
**********************************************************************/
+ib_api_status_t
+osmtest_get_sw_info_rec_by_lid( IN osmtest_t * const p_osmt,
+                                IN ib_net16_t const  lid,
+                                IN OUT osmtest_req_context_t * const
p_context )
+{
+  ib_api_status_t status = IB_SUCCESS;
+  osmv_user_query_t user;
+  osmv_query_req_t req;
+  ib_switch_info_record_t record;
+  ib_mad_t *p_mad;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_get_sw_info_rec_by_lid );
+
+  if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+             "osmtest_get_sw_info_rec_by_lid: "
+             "Getting SwitchInfo record for LID 0x%02X\n",
+             cl_ntoh16( lid ) );
+  }
+
+  /*
+   * Do a blocking query for this record in the subnet.
+   * The result is returned in the result field of the caller's
+   * context structure.
+   *
+   * The query structures are locals.
+   */
+  memset( &req, 0, sizeof( req ) );
+  memset( &user, 0, sizeof( user ) );
+  memset( &record, 0, sizeof( record ) );
+
+  record.lid = lid;
+  p_context->p_osmt = p_osmt;
+  user.comp_mask = IB_SWIR_COMPMASK_LID;
+  user.attr_id = IB_MAD_ATTR_SWITCH_INFO_RECORD;
+  user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 )
);
+  user.p_attr = &record;
+
+  req.query_type = OSMV_QUERY_USER_DEFINED;
+  req.timeout_ms = p_osmt->opt.transaction_timeout;
+  req.retry_cnt = p_osmt->opt.retry_count;
+
+  req.flags = OSM_SA_FLAGS_SYNC;
+  req.query_context = p_context;
+  req.pfn_query_cb = osmtest_query_res_cb;
+  req.p_query_input = &user;
+  req.sm_key = 0;
+
+  status = osmv_query_sa( p_osmt->h_bind, &req );
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_sw_info_rec_by_lid: ERR 006C: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    goto Exit;
+  }
+
+  status = p_context->result.status;
+
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_sw_info_rec_by_lid: ERR 006D: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    if( status == IB_REMOTE_ERROR )
+    {
+      p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw );
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_get_sw_info_rec_by_lid: "
+               "Remote error = %s\n",
+               ib_get_mad_status_str( p_mad ));
+
+      status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK );
+    }
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
+
+/**********************************************************************
  * Get LFT record by LID
 
**********************************************************************/
 ib_api_status_t
@@ -5820,6 +5906,17 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  /* SwitchInfo Record tests */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_sw_info_rec_by_lid( p_osmt, 0, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_sw_info_rec_by_lid( p_osmt, test_lid, &context
);
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
   /* LFT Record test */
   memset( &context, 0, sizeof( context ) );
   status = osmtest_get_lft_rec_by_lid( p_osmt, test_lid, &context );
@@ -6169,6 +6266,12 @@ osmtest_validate_against_db( IN osmtest_
     if ( status != IB_SUCCESS )
       goto Exit;
 
+    /* Another SwitchInfo Record test */
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_get_sw_info_rec_by_lid( p_osmt, test_lid, &context
);
+    if ( status != IB_SUCCESS )
+      goto Exit;
+
     /* Another LFT Record test */
     memset( &context, 0, sizeof( context ) );
     status = osmtest_get_lft_rec_by_lid( p_osmt, test_lid, &context );


From jsquyres at cisco.com  Wed Dec 27 09:02:41 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 12:02:41 -0500
Subject: [openib-general] SVN deprecation
Message-ID: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>

I propose "svn rm"'ing unused trees in the SVN repository and leaving  
README files indicating that everything has moved to git (remember:  
everything is still available via the SVN history).  If no one has  
any objections, I'll do this on Friday, 5 Jan 2007.

** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments!   
Otherwise, things may disappear from SVN that you didn't expect.

UNKNOWN whether to keep or remove:
(i.e., they seem to have "recent" development)
==============================================

DEVELOPER  MTIME     PATH
---------  --------  ----------------------------------
dotanb     Dec 2006  /trunk/contrib/mellanox
vlad       Dec 2006  /gen2/trunk/ofed
swise      Oct 2006  /gen2/branches/iwarp
hnguyen    Sep 2006  /trunk/contrib/ibm
amitk      Sep 2006  /gen2/branches/1.0
vlad       Sep 2006  /gen2/branches/ofed_fixes
monil      Sep 2006  /gen2/branches/backport
woody      Sep 2006  /gen2/branches/backport-to-2.6.9
halr       May 2006  /gen2/branches/ibat
mst        Jul 2006  /gen2/branches/mellanox_fixes

KEEP the following:
===================

- /gen2/branches/1.1: by request (Tziporet)

REMOVE the following:
=====================

In short, everything will be removed except what was listed above.   
However, to be explicit, some more entries are listed below.

(*) entries mean "everything except what was already listed above"

Remove these trees based on the fact that they haven't changed in a  
long time:

MTIME     PATH
--------- ------------------------------
Apr 2006  /trunk/contrib/*
Apr 2006  /trunk/branches/*
Apr 2006  /gen2/ulps
Apr 2006  /gen2/branches/*
Mar 2006  /gen2/users
May 2005  /gen1
Jan 2005  /gen2/trunk/arch
Dec 2004  /gen2/utils
Nov 2004  /gen2/trunk/scripts
Jul 2004  /tags
Apr 2004  /trunk/openib

Remove these trees for additional rationale:

- /branches: it's empty
- /gen2/tags: replaced by OFED and git
- /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!)

Comments?

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From sashak at voltaire.com  Wed Dec 27 09:18:13 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 27 Dec 2006 19:18:13 +0200
Subject: [openib-general] [PATCH 2/3] osm: Changes for windows
 compatability
In-Reply-To: <459236F6.8060707@dev.mellanox.co.il>
References: <459236F6.8060707@dev.mellanox.co.il>
Message-ID: <20061227171813.GA11268@sashak.voltaire.com>

Hi Yevgeny,

On 11:03 Wed 27 Dec     , Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Fixing windows compilation problems.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/opensm/osm_ucast_ftree.c |   42 ++++++++++++++++++++++--------------------
>  1 files changed, 22 insertions(+), 20 deletions(-)
> 
> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
> index ba95a0d..054e3c9 100644
> --- a/osm/opensm/osm_ucast_ftree.c
> +++ b/osm/opensm/osm_ucast_ftree.c

[snip..]

> @@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_
>   **
>   ***************************************************/
>  
> -int
> +int OSM_CDECL
>  __osm_ftree_compare_switches_by_index(
>     IN  const void * p1, 
>     IN  const void * p2)

Is this function is used somewhere in a global namespace? If no, this
probably should be 'static' and don't have OSM_CDECL attribute. If yes,
isn't this cleaner to have OSM_CDECL in header file, where the function
prototype is located?

> @@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index(
>  
>  /***************************************************/
>  
> -int
> +int OSM_CDECL
>  __osm_ftree_compare_port_groups_by_remote_switch_index(
>     IN  const void * p1, 
>     IN  const void * p2)

Ditto.

Sasha


From sashak at voltaire.com  Wed Dec 27 09:25:29 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Wed, 27 Dec 2006 19:25:29 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
 compatability
In-Reply-To: <4592956F.3020501@dev.mellanox.co.il>
References: <4592956F.3020501@dev.mellanox.co.il>
Message-ID: <20061227172529.GB11268@sashak.voltaire.com>

On 17:46 Wed 27 Dec     , Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Fixing windows compilation problems
> [V2 - Previous patch had an error]
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/include/iba/ib_types.h |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
> index 723e8b9..ec65b64 100644
> --- a/osm/include/iba/ib_types.h
> +++ b/osm/include/iba/ib_types.h
> @@ -59,9 +59,10 @@ BEGIN_C_DECLS
>           #define OSM_EXPORT	__declspec(dllimport)
>      #endif
>      #define OSM_API __stdcall
> +    #define OSM_CDECL __cdecl
>  #else
>      #define OSM_EXPORT	extern
>      #define OSM_API
> +    #define OSM_CDECL
>      #define __ptr64
>  #endif

Just wondering, how does lack of __cdecl hurt windows compilation (in
the context of where those __cdecl is used)?

What is the reason to have both __stdcall and __cdecl (and what is the
default)?

Sasha


From halr at voltaire.com  Wed Dec 27 09:17:54 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 12:17:54 -0500
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
	compatability
In-Reply-To: <4592956F.3020501@dev.mellanox.co.il>
References: <4592956F.3020501@dev.mellanox.co.il>
Message-ID: <1167239871.29620.76806.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 10:46, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Fixing windows compilation problems
> [V2 - Previous patch had an error]
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Dec 27 09:18:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 12:18:27 -0500
Subject: [openib-general] [PATCH 2/3] osm: Changes for windows
	compatability
In-Reply-To: <459236F6.8060707@dev.mellanox.co.il>
References: <459236F6.8060707@dev.mellanox.co.il>
Message-ID: <1167239876.29620.76808.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 04:03, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Fixing windows compilation problems.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Dec 27 09:18:33 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 12:18:33 -0500
Subject: [openib-general] [PATCH 3/3] osm: Changes for windows
	compatability
In-Reply-To: <4592374E.7020008@dev.mellanox.co.il>
References: <4592374E.7020008@dev.mellanox.co.il>
Message-ID: <1167239903.29620.76873.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 04:05, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Fixing windows compilation problems.
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Wed Dec 27 09:24:38 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 12:24:38 -0500
Subject: [openib-general] [PATCH] osm: additional check of tree topology
In-Reply-To: <4592958B.7030102@dev.mellanox.co.il>
References: <4592958B.7030102@dev.mellanox.co.il>
Message-ID: <1167240276.29620.77186.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 10:47, Yevgeny Kliteynik wrote:
> Hi Hal
> 
> As we've discussed before - added check for fat-tree topology
> to be at least of rank 2.
> 
> --
> Yevgeny
> 
> Signed-off-by:  Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

-- Hal


From mst at mellanox.co.il  Wed Dec 27 09:26:58 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 19:26:58 +0200
Subject: [openib-general] Old svn repository access
In-Reply-To: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com>
References: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com>
Message-ID: <20061227172658.GB5377@mellanox.co.il>

> What exactly in OFED 1.0 uses the name openib.org -- SVN access?

Yes.

-- 
MST


From halr at voltaire.com  Wed Dec 27 09:32:28 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 27 Dec 2006 12:32:28 -0500
Subject: [openib-general] [PATCH] osm: fat-tree documentation
In-Reply-To: <45929D0B.3090308@dev.mellanox.co.il>
References: <45929D0B.3090308@dev.mellanox.co.il>
Message-ID: <1167240747.29620.77561.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote:
> Hi Hal.
> 
> Added fat-tree routing details and some cosmetics in the txt files.
> 
> --
> Yevgeny
> 
> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>

Thanks. Applied.

A couple of minor questions:

Should similar text as in current-routing.txt be added to the OpenSM man
page ?

Also, rather than HCA in the below, is CA better (to include TCAs as
well) ?

-- Hal


From mst at mellanox.co.il  Wed Dec 27 09:42:45 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 19:42:45 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <45927D3A.9030502@voltaire.com>
References: <45927D3A.9030502@voltaire.com>
Message-ID: <20061227174245.GC5377@mellanox.co.il>

> 3rd Eitan/Michael: what is the bigger picture here? what is the 
> dependency between these four patches

In short, [2] is an independent fix to improve tavor performance.
Other things are not directly related. Detail below.

> +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors
> +2 osm: tavor quirk
> +3 IB/rdmacm: tavor quirk
> +4 IB/ipoib: use appropriate mtu selector for path queries

In the above:
[1] is a bug fix I think. It is not required for [2].
[2] is a feature that improves performance for tavor without need for
    any other stack/ULP changes
[3] is a hack that should have same effect as [2] for old SMs, but it needs
    manual tuning by user. If activated, it unfortunately triggers a bug in opensm
    that [1] fixes. So it might not be a good idea after all.
[4] is not strictly necessary, and not related to this patch set -
    it just happens to also play with MTU selector.
    It is a strict compliance cleanup that I just happened to notice when
    I invented [2].

> for example is it correct that:
> 
> if [2] is applied on the SA side then [4] must be applied on ipoib else 
> if will get 1K mtu on its path query?

Not really - ipoib does not actually use the MTU it gets from the path query,
according to spec it uses the bcast group mtu for all packets.

> if [2] is not applied on the SA side, then [3] is useless?

No. If [2] is applied on te SA side, the [3] is unnecessary.


-- 
MST


From jsquyres at cisco.com  Wed Dec 27 09:54:03 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 12:54:03 -0500
Subject: [openib-general] Old svn repository access
In-Reply-To: <20061227172658.GB5377@mellanox.co.il>
References: <9CAB368F-98E3-46A3-AF20-FD2438F4850C@cisco.com>
	<20061227172658.GB5377@mellanox.co.il>
Message-ID: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com>

Ok.  Does that mean we need to keep OFED 1.0 available in SVN (and  
not "svn rm" it)?  See my mail from earlier today about SVN.


On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote:

>> What exactly in OFED 1.0 uses the name openib.org -- SVN access?
>
> Yes.
>
> -- 
> MST


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From mst at mellanox.co.il  Wed Dec 27 09:55:24 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 19:55:24 +0200
Subject: [openib-general] SVN deprecation
In-Reply-To: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
References: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
Message-ID: <20061227175524.GC6644@mellanox.co.il>

> mst        Jul 2006  /gen2/branches/mellanox_fixes

Remove.

-- 
MST


From mst at mellanox.co.il  Wed Dec 27 09:56:46 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 19:56:46 +0200
Subject: [openib-general] Old svn repository access
In-Reply-To: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com>
References: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com>
Message-ID: <20061227175646.GD6644@mellanox.co.il>

Yes, I think it's a good idea to keep OFED 1.0 around and not
svn rm it.

Quoting r. Jeff Squyres <jsquyres at cisco.com>:
Subject: Re: Old svn repository access

Ok.  Does that mean we need to keep OFED 1.0 available in SVN (and  
not "svn rm" it)?  See my mail from earlier today about SVN.


On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote:

>> What exactly in OFED 1.0 uses the name openib.org -- SVN access?
>
> Yes.
>
> -- 
> MST


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

-- 
MST


From mst at mellanox.co.il  Wed Dec 27 10:06:02 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 20:06:02 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
	compatability
In-Reply-To: <4592956F.3020501@dev.mellanox.co.il>
References: <4592956F.3020501@dev.mellanox.co.il>
Message-ID: <20061227180602.GE6644@mellanox.co.il>

> Hi Hal.
> 
> Fixing windows compilation problems
> [V2 - Previous patch had an error]

I don't think "fixing windows compilation" is a real log description.
What kind of errors? Isn't there a better fix?

> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> ---
>  osm/include/iba/ib_types.h |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
> index 723e8b9..ec65b64 100644
> --- a/osm/include/iba/ib_types.h
> +++ b/osm/include/iba/ib_types.h
> @@ -59,9 +59,10 @@ BEGIN_C_DECLS
>           #define OSM_EXPORT	__declspec(dllimport)
>      #endif
>      #define OSM_API __stdcall
> +    #define OSM_CDECL __cdecl
>  #else
>      #define OSM_EXPORT	extern
>      #define OSM_API
> +    #define OSM_CDECL
>      #define __ptr64
>  #endif
 
Why is this necessary at all?
http://msdn2.microsoft.com/en-us/library/zkwh89ks.aspx
	Microsoft Specific
	This is the default calling convention for C and C++ programs.

In other words it's the default, you don't have to declare it.

	Place the __cdecl modifier before a variable or a function name. Because the C
	naming and calling conventions are the default, the only time you need to use
	__cdecl is when you have specified the /Gz (stdcall) or /Gr (fastcall) compiler
	option. The /Gd compiler option forces the __cdecl calling convention.

So why are you compiling with /Gz, after the code is already littered with
OSM_API? And why is OSM_API necessary?

It seems to me the right thing might be to remove all of OSM_API/OSM_CDECL
from code, and just build everything on windows with consistent compiler flags.


-- 
MST


From mst at mellanox.co.il  Wed Dec 27 10:10:48 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Wed, 27 Dec 2006 20:10:48 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
	compatability
In-Reply-To: <1167239871.29620.76806.camel@hal.voltaire.com>
References: <4592956F.3020501@dev.mellanox.co.il>
	<1167239871.29620.76806.camel@hal.voltaire.com>
Message-ID: <20061227181048.GF6644@mellanox.co.il>

> > Hi Hal.
> > 
> > Fixing windows compilation problems
> > [V2 - Previous patch had an error]
> > 
> > Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Thanks. Applied.

The log is not really informative - shouldn't it say what does this fix?
In this case, it is forcing a specific calling convention on code -
its a bit more that just "fixing compilation" as it claims.

I'm worried that windows-related patches don't seem to be properly peer-reviewed.
Wouldn't looking things up on msdn before applying windows-related stuff be
a good idea?

-- 
MST


From jsquyres at cisco.com  Wed Dec 27 10:14:09 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 13:14:09 -0500
Subject: [openib-general] SVN deprecation
In-Reply-To: <20061227175524.GC6644@mellanox.co.il>
References: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
	<20061227175524.GC6644@mellanox.co.il>
Message-ID: <CB920D3F-27AD-4A1F-878A-ACA923A7D79B@cisco.com>

So noted -- thanks!

On Dec 27, 2006, at 12:55 PM, Michael S. Tsirkin wrote:

>> mst        Jul 2006  /gen2/branches/mellanox_fixes
>
> Remove.
>
> -- 
> MST


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From jsquyres at cisco.com  Wed Dec 27 10:14:44 2006
From: jsquyres at cisco.com (Jeff Squyres)
Date: Wed, 27 Dec 2006 13:14:44 -0500
Subject: [openib-general] Old svn repository access
In-Reply-To: <20061227175646.GD6644@mellanox.co.il>
References: <7364ED3C-F2C9-4E3F-B15B-9FA3E7E2672B@cisco.com>
	<20061227175646.GD6644@mellanox.co.il>
Message-ID: <F213F4F3-14FC-4FA3-B89C-2AB766B4307E@cisco.com>

On Dec 27, 2006, at 12:56 PM, Michael S. Tsirkin wrote:

> Yes, I think it's a good idea to keep OFED 1.0 around and not
> svn rm it.

So noted -- won't remove.  Thanks!

> Quoting r. Jeff Squyres <jsquyres at cisco.com>:
> Subject: Re: Old svn repository access
>
> Ok.  Does that mean we need to keep OFED 1.0 available in SVN (and
> not "svn rm" it)?  See my mail from earlier today about SVN.
>
>
> On Dec 27, 2006, at 12:26 PM, Michael S. Tsirkin wrote:
>
>>> What exactly in OFED 1.0 uses the name openib.org -- SVN access?
>>
>> Yes.
>>
>> -- 
>> MST
>
>
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
>
> -- 
> MST


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


From mshefty at ichips.intel.com  Wed Dec 27 11:50:33 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 27 Dec 2006 11:50:33 -0800
Subject: [openib-general] IB_CM_REJ_INVALID_SERVICE_ID
In-Reply-To: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com>
References: <200612202222.kBKMMDeY020463@robert.bartonsoftware.com>
Message-ID: <4592CE89.2060005@ichips.intel.com>

Eric Barton wrote:
> Can an rdma_connect be rejected with IB_CM_REJ_INVALID_SERVICE_ID for any other
> reason than the peer isn't listening with the correct service number?

This should only occur if the remote peer isn't listening.  This reject code is 
automatically sent by the ib_cm when a request does not find a corresponding listen.

>>We are testing 1.6b5 for a InfiniBand cluster with RHEL 4. We use the 
>>binaries provides by CFS and use OFED 1.1 as the IB stack.
>>
>>At several times some of the clients hang during fs mount or when an OST 
>>is added (see log).
>>Error:
>>LustreError: 1776:0:(o2iblnd_cb.c:2314:kiblnd_rejected()) 10.0.90.8 at o2ib 
>>rejected: reason 8, size 148

Is this event = 8 and status = 8?

>>
>>from OFED:
>>enum ib_cm_rej_reason {
>>       IB_CM_REJ_INVALID_SERVICE_ID            = 8,
>>
>>Once an IPoIB ping is started to the corresponding OST the client 
>>continues. Afterwards it is quite stable.
> 
> 
> ...which seems to be saying that just doing an IPoIB ping to the server was
> enough to make rdma_connect() work OK.

I can't explain the relationship between the ping and the connect starting to work.

- Sean


From mshefty at ichips.intel.com  Wed Dec 27 12:00:36 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Wed, 27 Dec 2006 12:00:36 -0800
Subject: [openib-general] No resource tracking per qp for multicast
 groups
In-Reply-To: <458FC332.1010801@voltaire.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com>
	<458FC332.1010801@voltaire.com>
Message-ID: <4592D0E4.2010006@ichips.intel.com>

> Per my understanding the issues you describe here are orthogonal to 
> Sean's multicast work, correct? were they solved in mthca or its still 
> open?

This is orthogonal to the multicast module, which tracks joins made to the SA. 
I do not know if this problem was solved however.

- Sean


From kliteyn at dev.mellanox.co.il  Wed Dec 27 13:05:27 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 23:05:27 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
	compatability
In-Reply-To: <20061227180602.GE6644@mellanox.co.il>
References: <4592956F.3020501@dev.mellanox.co.il>
	<20061227180602.GE6644@mellanox.co.il>
Message-ID: <4592E017.1050508@dev.mellanox.co.il>

Michael S. Tsirkin wrote:
>> Hi Hal.
>>
>> Fixing windows compilation problems
>> [V2 - Previous patch had an error]
> 
> I don't think "fixing windows compilation" is a real log description.
> What kind of errors? Isn't there a better fix?
> 
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/include/iba/ib_types.h |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
>> index 723e8b9..ec65b64 100644
>> --- a/osm/include/iba/ib_types.h
>> +++ b/osm/include/iba/ib_types.h
>> @@ -59,9 +59,10 @@ BEGIN_C_DECLS
>>           #define OSM_EXPORT	__declspec(dllimport)
>>      #endif
>>      #define OSM_API __stdcall
>> +    #define OSM_CDECL __cdecl
>>  #else
>>      #define OSM_EXPORT	extern
>>      #define OSM_API
>> +    #define OSM_CDECL
>>      #define __ptr64
>>  #endif
>  
> Why is this necessary at all?
> http://msdn2.microsoft.com/en-us/library/zkwh89ks.aspx
> 	Microsoft Specific
> 	This is the default calling convention for C and C++ programs.
> In other words it's the default, you don't have to declare it.
> 
> 	Place the __cdecl modifier before a variable or a function name. Because the C
> 	naming and calling conventions are the default, the only time you need to use
> 	__cdecl is when you have specified the /Gz (stdcall) or /Gr (fastcall) compiler
> 	option. The /Gd compiler option forces the __cdecl calling convention.
> 
> So why are you compiling with /Gz, after the code is already littered with
> OSM_API? And why is OSM_API necessary?

I did saw that __cdecl is default on windows. However, the compiler complained
about a certain function (more specifically - about a comparison function that
is supplied as an argument to qsort() function) that it's defined as __stdcall 
instead of __cdecl. As you say, it's probably because of compilation flag - I 
didn't investigate this issue.

> It seems to me the right thing might be to remove all of OSM_API/OSM_CDECL
> from code, and just build everything on windows with consistent compiler flags.

I'll check with the windows guys why do we have such compilation flag (assuming we 
do have it), and whether it can be removed.

-- Yevgeny


From kliteyn at dev.mellanox.co.il  Wed Dec 27 13:30:08 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 23:30:08 +0200
Subject: [openib-general] [PATCH 2/3] osm: Changes for windows
 compatability
In-Reply-To: <20061227171813.GA11268@sashak.voltaire.com>
References: <459236F6.8060707@dev.mellanox.co.il>
	<20061227171813.GA11268@sashak.voltaire.com>
Message-ID: <4592E5E0.2060505@dev.mellanox.co.il>

Hi Sasha.

Sasha Khapyorsky wrote:
> Hi Yevgeny,
> 
> On 11:03 Wed 27 Dec     , Yevgeny Kliteynik wrote:
>> Hi Hal.
>>
>> Fixing windows compilation problems.
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/opensm/osm_ucast_ftree.c |   42 ++++++++++++++++++++++--------------------
>>  1 files changed, 22 insertions(+), 20 deletions(-)
>>
>> diff --git a/osm/opensm/osm_ucast_ftree.c b/osm/opensm/osm_ucast_ftree.c
>> index ba95a0d..054e3c9 100644
>> --- a/osm/opensm/osm_ucast_ftree.c
>> +++ b/osm/opensm/osm_ucast_ftree.c
> 
> [snip..]
> 
>> @@ -226,7 +226,7 @@ typedef struct ftree_fabric_t_
>>   **
>>   ***************************************************/
>>  
>> -int
>> +int OSM_CDECL
>>  __osm_ftree_compare_switches_by_index(
>>     IN  const void * p1, 
>>     IN  const void * p2)
> 
> Is this function is used somewhere in a global namespace? If no, this
> probably should be 'static' and don't have OSM_CDECL attribute. If yes,
> isn't this cleaner to have OSM_CDECL in header file, where the function
> prototype is located?
 
The function should be 'static __cdecl'.
I'll check with the windows guys regarding the __cdecl not being default.

>> @@ -247,7 +247,7 @@ __osm_ftree_compare_switches_by_index(
>>  
>>  /***************************************************/
>>  
>> -int
>> +int OSM_CDECL
>>  __osm_ftree_compare_port_groups_by_remote_switch_index(
>>     IN  const void * p1, 
>>     IN  const void * p2)
> 
> Ditto.

Right, same thing here.

Thanks.

--Yevgeny

 
> Sasha
> 


From kliteyn at dev.mellanox.co.il  Wed Dec 27 13:36:40 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Wed, 27 Dec 2006 23:36:40 +0200
Subject: [openib-general] [PATCH 1/3 v2] osm: Changes for windows
 compatability
In-Reply-To: <20061227172529.GB11268@sashak.voltaire.com>
References: <4592956F.3020501@dev.mellanox.co.il>
	<20061227172529.GB11268@sashak.voltaire.com>
Message-ID: <4592E768.3090005@dev.mellanox.co.il>

Sasha Khapyorsky wrote:
> On 17:46 Wed 27 Dec     , Yevgeny Kliteynik wrote:
>> Hi Hal.
>>
>> Fixing windows compilation problems
>> [V2 - Previous patch had an error]
>>
>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> ---
>>  osm/include/iba/ib_types.h |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
>> index 723e8b9..ec65b64 100644
>> --- a/osm/include/iba/ib_types.h
>> +++ b/osm/include/iba/ib_types.h
>> @@ -59,9 +59,10 @@ BEGIN_C_DECLS
>>           #define OSM_EXPORT	__declspec(dllimport)
>>      #endif
>>      #define OSM_API __stdcall
>> +    #define OSM_CDECL __cdecl
>>  #else
>>      #define OSM_EXPORT	extern
>>      #define OSM_API
>> +    #define OSM_CDECL
>>      #define __ptr64
>>  #endif
> 
> Just wondering, how does lack of __cdecl hurt windows compilation (in
> the context of where those __cdecl is used)?
> What is the reason to have both __stdcall and __cdecl (and what is the
> default)?

Hi Sasha.

The __cdecl is default on windows. However, the compiler complained
about a certain function (more specifically - about a comparison function that
is supplied as an argument to qsort() function) that it's defined as __stdcall 
instead of __cdecl. As MST has pointed out, it's probably because of compilation 
flag.

I'll check with the windows guys why do we have such compilation flag (assuming we 
do have it), and whether it can be removed. Same goes for the __stdcall - I'm sure
there is some historical reason for having it. The question is - do we still need it.

Thanks.

-- Yevgeny
 
> Sasha
> 


From sashak at voltaire.com  Wed Dec 27 15:09:15 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 28 Dec 2006 01:09:15 +0200
Subject: [openib-general] [PATCH] diags: fix loops handling in ibnetdiscover
Message-ID: <20061227230915.GF11268@sashak.voltaire.com>


This fixes loop cabling and loopback connections handling in
ibnetdiscover.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 diags/src/ibnetdiscover.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c
index 71f6b83..31b7063 100644
--- a/diags/src/ibnetdiscover.c
+++ b/diags/src/ibnetdiscover.c
@@ -338,7 +338,7 @@ handle_port(Node *node, Port *port, ib_portid_t *path, int portnum, int dist)
 		free(remotenode);
 
 		/* Handle loopback plug */
-		if (port->portguid == remoteport->portguid) {
+		if (port->portnum == remoteport->portnum) {
 			free(remoteport);
 			remoteport = port;
 		}
-- 
1.4.4.2.gfc82d


From sashak at voltaire.com  Wed Dec 27 15:10:17 2006
From: sashak at voltaire.com (Sasha Khapyorsky)
Date: Thu, 28 Dec 2006 01:10:17 +0200
Subject: [openib-general] [PATCH] diags: eliminate __WORDSIZE ifdefs for
	printing
Message-ID: <20061227231017.GG11268@sashak.voltaire.com>


Use portable PRIx64 macro in printf format strings instead of using
'#if __WORDSIZE == 64' with printf style functions.

Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>
---
 diags/src/ibnetdiscover.c |   63 +++++--------------------------------
 diags/src/ibroute.c       |   15 ++-------
 diags/src/ibtracert.c     |   74 +++++----------------------------------------
 diags/src/sminfo.c        |    8 +----
 4 files changed, 22 insertions(+), 138 deletions(-)

diff --git a/diags/src/ibnetdiscover.c b/diags/src/ibnetdiscover.c
index 31b7063..0b5078b 100644
--- a/diags/src/ibnetdiscover.c
+++ b/diags/src/ibnetdiscover.c
@@ -213,21 +213,12 @@ dump_endnode(ib_portid_t *path, char *prompt, Node *node, Port *port)
 	if (!dumplevel)
 		return;
 
-#if __WORDSIZE == 64
-	fprintf(f, "%s -> %s %s {%016lx} portnum %d lid %d-%d\"%s\"\n",
+	fprintf(f, "%s -> %s %s {%016" PRIx64 "} portnum %d lid %d-%d\"%s\"\n",
 		portid2str(path), prompt,
 		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 		node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum,
 		port->lid, port->lid + (1 << port->lmc) - 1,
 		clean_nodedesc(node->nodedesc));
-#else
-	fprintf(f, "%s -> %s %s {%016Lx} portnum %d lid %d-%d\"%s\"\n",
-		portid2str(path), prompt,
-		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-		node->nodeguid, node->type == SWITCH_NODE ? 0 : port->portnum,
-		port->lid, port->lid + (1 << port->lmc) - 1,
-		clean_nodedesc(node->nodedesc));
-#endif
 }
 
 #define HASHGUID(guid)		((uint32_t)(((uint32_t)(guid) * 101) ^ ((uint32_t)((guid) >> 32) * 103)))
@@ -265,11 +256,7 @@ link_port(Port *port, Node *node, Port *remoteport)
 		}
 
 	if (dumplevel)
-#if __WORDSIZE == 64
-		fprintf(f, "\t[%d] {%016lx}\n", port->portnum, port->portguid);
-#else
-		fprintf(f, "\t[%d] {%016Lx}\n", port->portnum, port->portguid);
-#endif
+		fprintf(f, "\t[%d] {%016" PRIx64 "}\n", port->portnum, port->portguid);
 
 	DEBUG("inserting new port %p (%d) to node %p", port, port->portnum, node);
 	port->node = node;
@@ -447,13 +434,8 @@ node_name(Node *node)
 {
 	static char buf[256];
 
-#if __WORDSIZE == 64
-	sprintf(buf, "\"%s-%016lx\"",
+	sprintf(buf, "\"%s-%016" PRIx64 "\"",
 		node->type == SWITCH_NODE ? "S" : "H", node->nodeguid);
-#else
-	sprintf(buf, "\"%s-%016Lx\"",
-		node->type == SWITCH_NODE ? "S" : "H", node->nodeguid);
-#endif
 
 	return buf;
 }
@@ -477,17 +459,10 @@ list_node(Node *node)
 		node_type = "???";
 		break;
 	}
-#if __WORDSIZE == 64
-	fprintf(f, "%s\t : 0x%016lx ports %d devid 0x%x vendid 0x%x \"%s\"\n",
-		node_type,
-		node->nodeguid, node->numports, node->devid, node->vendid,
-		clean_nodedesc(node->nodedesc));
-#else
-	fprintf(f, "%s\t : 0x%016Lx ports %d devid 0x%x vendid 0x%x \"%s\"\n",
+	fprintf(f, "%s\t : 0x%016" PRIx64 " ports %d devid 0x%x vendid 0x%x \"%s\"\n",
 		node_type,
 		node->nodeguid, node->numports, node->devid, node->vendid,
 		clean_nodedesc(node->nodedesc));
-#endif
 }
 
 void
@@ -495,11 +470,7 @@ out_ids(Node *node)
 {
 	fprintf(f, "\nvendid=0x%x\ndevid=0x%x\n", node->vendid, node->devid);
 	if (node->sysimgguid)
-#if __WORDSIZE == 64
-		fprintf(f, "sysimgguid=0x%lx\n", node->sysimgguid);
-#else
-		fprintf(f, "sysimgguid=0x%Lx\n", node->sysimgguid);
-#endif
+		fprintf(f, "sysimgguid=0x%" PRIx64 "\n", node->sysimgguid);
 }
 
 void
@@ -514,11 +485,7 @@ out_chassis(Node *node)
 	fprintf(f, "\nChassis %d", node->chrecord->chassisnum);
 	guid = get_chassis_guid(node->chrecord->chassisnum);
 	if (guid) {
-#if __WORDSIZE == 64
-		fprintf(f, " (guid 0x%lx)", guid);
-#else
-		fprintf(f, " (guid 0x%Lx)", guid);
-#endif
+		fprintf(f, " (guid 0x%" PRIx64 ")", guid);
 	}
 	fprintf(f, "\n");
 }
@@ -541,11 +508,7 @@ out_switch(Node *node, int group)
 	}
 
 	out_ids(node);
-#if __WORDSIZE == 64
-	fprintf(f, "%s=0x%lx", "switchguid", node->nodeguid);
-#else
-	fprintf(f, "%s=0x%Lx", "switchguid", node->nodeguid);
-#endif
+	fprintf(f, "switchguid=0x%" PRIx64, node->nodeguid);
 	if (group) {
 		if (node->chrecord) {
 			if (node->chrecord->chassisnum) {
@@ -592,11 +555,7 @@ out_ca(Node *node)
 		node_type2 = "???";
 		break;
 	}
-#if __WORDSIZE == 64
-	fprintf(f, "%s%s=0x%lx\n", node_type, "guid", node->nodeguid);
-#else
-	fprintf(f, "%s%s=0x%Lx\n", node_type, "guid", node->nodeguid);
-#endif
+	fprintf(f, "%sguid=0x%" PRIx64 "\n", node_type, node->nodeguid);
 	fprintf(f, "%s\t%d %s\t\t# %s\n",
 		node_type2, node->numports, node_name(node),
 		clean_nodedesc(node->nodedesc));
@@ -649,11 +608,7 @@ dump_topology(int listtype, int group)
 	if (!listtype) {
 		fprintf(f, "#\n# Topology file: generated on %s#\n", ctime(&t));
 		fprintf(f, "# Max of %d hops discovered\n", maxhops_discovered);
-#if __WORDSIZE == 64
-		fprintf(f, "# Initiated from node %016lx port %016lx\n", mynode->nodeguid, mynode->portguid);
-#else
-		fprintf(f, "# Initiated from node %016Lx port %016Lx\n", mynode->nodeguid, mynode->portguid);
-#endif
+		fprintf(f, "# Initiated from node %016" PRIx64 " port %016" PRIx64 "\n", mynode->nodeguid, mynode->portguid);
 	}
 
 	/* Make pass on switches */
diff --git a/diags/src/ibroute.c b/diags/src/ibroute.c
index f590fdd..8152b6d 100644
--- a/diags/src/ibroute.c
+++ b/diags/src/ibroute.c
@@ -41,6 +41,7 @@
 #include <stdarg.h>
 #include <time.h>
 #include <string.h>
+#include <inttypes.h>
 #include <getopt.h>
 #include <netinet/in.h>
 
@@ -192,13 +193,8 @@ dump_multicast_tables(ib_portid_t *portid, int startlid, int endlid)
 		endlid = IB_MAX_MCAST_LID;
 	}
 
-#if __WORDSIZE == 64
-	printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016lx (%s):\n",
+	printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016" PRIx64 " (%s):\n",
 		startlid, endlid, portid2str(portid), nodeguid, nd);
-#else
-	printf("Multicast mlids [0x%x-0x%x] of switch %s guid 0x%016Lx (%s):\n",
-		startlid, endlid, portid2str(portid), nodeguid, nd);
-#endif
 
 	if (brief)
 		printf(" MLid       Port Mask\n");
@@ -338,13 +334,8 @@ dump_unicast_tables(ib_portid_t *portid, int startlid, int endlid)
 		endlid = IB_MAX_UCAST_LID;
 	}
 
-#if __WORDSIZE == 64
-	printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016lx (%s):\n",
-		startlid, endlid, portid2str(portid), nodeguid, nd);
-#else
-	printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016Lx (%s):\n",
+	printf("Unicast lids [0x%x-0x%x] of switch %s guid 0x%016" PRIx64 " (%s):\n",
 		startlid, endlid, portid2str(portid), nodeguid, nd);
-#endif
 	DEBUG("Switch top is 0x%x\n", top);
 
 	printf("  Lid  Out   Destination\n");
diff --git a/diags/src/ibtracert.c b/diags/src/ibtracert.c
index bfa3d25..e545e9a 100644
--- a/diags/src/ibtracert.c
+++ b/diags/src/ibtracert.c
@@ -214,32 +214,17 @@ dump_endnode(int dump, char *prompt, Node *node, Port *port)
 	if (!dump)
 		return;
 	if (dump == 1) {
-#if __WORDSIZE == 64
-		fprintf(f, "%s {%016lx}[%d]\n",
+		fprintf(f, "%s {%016" PRIx64 "}[%d]\n",
 			prompt, node->nodeguid,
 			node->type == IB_NODE_SWITCH ? 0 : port->portnum);
-#else
-		fprintf(f, "%s {%016Lx}[%d]\n",
-			prompt, node->nodeguid,
-			node->type == IB_NODE_SWITCH ? 0 : port->portnum);
-#endif
 		return;
 	}
-#if __WORDSIZE == 64
-	fprintf(f, "%s %s {%016lx} portnum %d lid 0x%x-0x%x \"%s\"\n",
-		prompt,
-		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-		node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum,
-		port->lid, port->lid + (1 << port->lmc) - 1,
-		node->nodedesc);
-#else
-	fprintf(f, "%s %s {%016Lx} portnum %d lid 0x%x-0x%x \"%s\"\n",
+	fprintf(f, "%s %s {%016" PRIx64 "} portnum %d lid 0x%x-0x%x \"%s\"\n",
 		prompt,
 		(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 		node->nodeguid, node->type == IB_NODE_SWITCH ? 0 : port->portnum,
 		port->lid, port->lid + (1 << port->lmc) - 1,
 		node->nodedesc);
-#endif
 }
 
 static void
@@ -247,29 +232,16 @@ dump_route(int dump, Node *node, int outport, Port *port)
 {
 	if (!dump && !verbose)
 		return;
-#if __WORDSIZE == 64
 	if (dump == 1)
-		fprintf(f, "[%d] -> {%016lx}[%d]\n",
+		fprintf(f, "[%d] -> {%016" PRIx64 "}[%d]\n",
 			outport, port->portguid, port->portnum);
 	else
-		fprintf(f, "[%d] -> %s port {%016lx}[%d] lid 0x%x-0x%x \"%s\"\n",
+		fprintf(f, "[%d] -> %s port {%016" PRIx64 "}[%d] lid 0x%x-0x%x \"%s\"\n",
 			outport,
 			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 			port->portguid, port->portnum,
 			port->lid, port->lid + (1 << port->lmc) - 1,
 			node->nodedesc);
-#else
-	if (dump == 1)
-		fprintf(f, "[%d] -> {%016Lx}[%d]\n",
-			outport, port->portguid, port->portnum);
-	else
-		fprintf(f, "[%d] -> %s port {%016Lx}[%d] lid 0x%x-0x%x \"%s\"\n",
-			outport,
-			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-			port->portguid, port->portnum,
-			port->lid, port->lid + (1 << port->lmc) - 1,
-			node->nodedesc);
-#endif
 }
 
 static int
@@ -667,65 +639,35 @@ dump_mcpath(Node *node, int dumplevel)
 		dump_mcpath(node->upnode, dumplevel);
 
 	if (!node->dist) {
-#if __WORDSIZE == 64
-		printf("From %s 0x%lx port %d lid 0x%x-0x%x \"%s\"\n",
-			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-			node->nodeguid, node->ports->portnum, node->ports->lid,
-			node->ports->lid + (1 << node->ports->lmc) - 1,
-			node->nodedesc);
-#else
-		printf("From %s 0x%Lx port %d lid 0x%x-0x%x \"%s\"\n",
+		printf("From %s 0x%" PRIx64 " port %d lid 0x%x-0x%x \"%s\"\n",
 			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 			node->nodeguid, node->ports->portnum, node->ports->lid,
 			node->ports->lid + (1 << node->ports->lmc) - 1,
 			node->nodedesc);
-#endif
 		return;
 	}
 
 	if (node->dist) {
-#if __WORDSIZE == 64
 		if (dumplevel == 1)
-			printf("[%d] -> %s {%016lx}[%d]\n",
+			printf("[%d] -> %s {%016" PRIx64 "}[%d]\n",
 				node->ports->remoteport->portnum,
 				(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 				node->nodeguid, node->upport);
 		else
-			printf("[%d] -> %s 0x%lx[%d] lid 0x%x \"%s\"\n",
+			printf("[%d] -> %s 0x%" PRIx64 "[%d] lid 0x%x \"%s\"\n",
 				node->ports->remoteport->portnum,
 				(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 				node->nodeguid, node->upport,
 				node->ports->lid, node->nodedesc);
-#else
-		if (dumplevel == 1)
-			printf("[%d] -> %s {%016Lx}[%d]\n",
-				node->ports->remoteport->portnum,
-				(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-				node->nodeguid, node->upport);
-		else
-			printf("[%d] -> %s 0x%Lx[%d] lid 0x%x \"%s\"\n",
-				node->ports->remoteport->portnum,
-				(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-				node->nodeguid, node->upport,
-				node->ports->lid, node->nodedesc);
-#endif
 	}
 
 	if (node->dist < 0)
 	/* target node */
-#if __WORDSIZE == 64
-		printf("To %s 0x%lx port %d lid 0x%x-0x%x \"%s\"\n",
+		printf("To %s 0x%" PRIx64 " port %d lid 0x%x-0x%x \"%s\"\n",
 			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
 			node->nodeguid, node->ports->portnum, node->ports->lid,
 			node->ports->lid + (1 << node->ports->lmc) - 1,
 			node->nodedesc);
-#else
-		printf("To %s 0x%Lx port %d lid 0x%x-0x%x \"%s\"\n",
-			(node->type <= IB_NODE_MAX ? node_type_str[node->type] : "???"),
-			node->nodeguid, node->ports->portnum, node->ports->lid,
-			node->ports->lid + (1 << node->ports->lmc) - 1,
-			node->nodedesc);
-#endif
 }
 
 static void
diff --git a/diags/src/sminfo.c b/diags/src/sminfo.c
index 98e2ed7..c01f195 100644
--- a/diags/src/sminfo.c
+++ b/diags/src/sminfo.c
@@ -39,6 +39,7 @@
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdarg.h>
+#include <inttypes.h>
 #include <getopt.h>
 
 #define __BUILD_VERSION_TAG__ 1.1
@@ -218,13 +219,8 @@ main(int argc, char **argv)
 	mad_decode_field(sminfo, IB_SMINFO_PRIO_F, &prio);
 	mad_decode_field(sminfo, IB_SMINFO_STATE_F, &state);
 
-#if __WORDSIZE == 64
-	printf("sminfo: sm lid %d sm guid 0x%lx, activity count %d priority %d state %d %s\n",
+	printf("sminfo: sm lid %d sm guid 0x%" PRIx64 ", activity count %d priority %d state %d %s\n",
 		portid.lid, guid, act, prio, state, STATESTR(state));
-#else
-	printf("sminfo: sm lid %d sm guid 0x%Lx, activity count %d priority %d state %d %s\n",
-		portid.lid, guid, act, prio, state, STATESTR(state));
-#endif
 
 	exit(0);
 }
-- 
1.4.4.2.gfc82d


From eitan at sw053.yok.mtl.com  Wed Dec 27 21:24:47 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Thu, 28 Dec 2006 07:24:47 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-28:normal completion
Message-ID: <200612280524.kBS5OlmA014141@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Wed_Dec_27_12:30:42_2006 61a6c6 
ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
Total=374 Pass=374 Fail=0

Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmTest IS1-16.topo
42 Multicast IS1-16.topo
41 OsmStress IS1-16.topo
39 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo

Failures:


From jackm at dev.mellanox.co.il  Thu Dec 28 00:09:32 2006
From: jackm at dev.mellanox.co.il (Jack Morgenstein)
Date: Thu, 28 Dec 2006 10:09:32 +0200
Subject: [openib-general] No resource tracking per qp for multicast
 groups
In-Reply-To: <4592D0E4.2010006@ichips.intel.com>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com>
	<458FC332.1010801@voltaire.com> <4592D0E4.2010006@ichips.intel.com>
Message-ID: <200612281009.32750.jackm@dev.mellanox.co.il>

On Wednesday 27 December 2006 22:00, Sean Hefty wrote:
> I do not know if this problem was solved however.
> 
The problem was solved in ofed 1.1.  See file core/uverbs_main.c, 
procedure ib_uverbs_cleanup_ucontext():

	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
		struct ib_qp *qp = uobj->object;
		struct ib_uqp_object *uqp =
			container_of(uobj, struct ib_uqp_object, uevent.uobject);

		idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
-->		ib_uverbs_detach_umcast(qp, uqp);
		ib_destroy_qp(qp);
		ib_uverbs_release_uevent(file, &uqp->uevent);
		kfree(uqp);
	}

- Jack


From spreggna at novell.com  Thu Dec 28 00:23:07 2006
From: spreggna at novell.com (Preggna S)
Date: Thu, 28 Dec 2006 01:23:07 -0700
Subject: [openib-general] [PATCH][TRIVIAL] srp_tools: trivial log message fix
Message-ID: <4593CC39.9947.00D4.0@novell.com>

Trivial log message fix...

Signed-off-by: Preggna S<spreggna at novell.com>
--

--- src/userspace/srptools/srp_daemon/srp_daemon.c      2006-11-20 11:54:23.000000000 +0530
+++ src_srpt_fixed/userspace/srptools/srp_daemon/srp_daemon.c   2006-12-27 15:40:50.000000000 +0530
@@ -1236,7 +1236,7 @@ int recalc(struct umad_resources *umad_r

        umad_res->sm_lid = strtol(val, NULL, 0);
        if (umad_res->sm_lid == 0) {
-               pr_err("SM LID is 0, maybe no opesm is running\n");
+               pr_err("SM LID is 0, maybe no opensm is running\n");
                return -1;
        }


From dotanb at dev.mellanox.co.il  Thu Dec 28 00:35:38 2006
From: dotanb at dev.mellanox.co.il (Dotan Barak)
Date: Thu, 28 Dec 2006 10:35:38 +0200
Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and restarting
 the driver several times causes kernel oops
Message-ID: <459381DA.7030007@dev.mellanox.co.il>

Hi Sean.

When i enabled the rdma_ucm (on the trunk driver) and restarted the
driver several times (using openibd restart) i got kernel oops.
Here is more info on this issue:

*************************************************************
Host Architecture : x86_64
Linux Distribution: Red Hat Enterprise Linux AS release 4 (Nahant Update 4)
Kernel Version    : 2.6.19-smp
GCC Version       : gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3)
Memory size       : 4041240 kB
Driver Version    : gen2_devel-20061226-1730
HCA ID(s)         : mthca0
HCA model(s)      : 25218
FW version(s)     : 5.1.940
Board(s)          : MT_0150000001
*************************************************************

here is the backtrace from the /var/log/messages:
Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer
dereference at 0000000000000001 RIP:
Dec 27 15:36:25 sw086 kernel:  [<0000000000000001>]
Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0
Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP
Dec 27 15:36:25 sw086 kernel: CPU 1
Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm
iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u
verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp
parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod
button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg
ext3 jbd sd_mod
Dec 27 15:36:25 sw086 kernel: Pid: 11363, comm: udev Not tainted
2.6.19-smp #1
Dec 27 15:36:25 sw086 kernel: RIP: 0010:[<0000000000000001>]
[<0000000000000001>]
Dec 27 15:36:25 sw086 kernel: RSP: 0018:ffff81012017dec0  EFLAGS: 00010282
Dec 27 15:36:25 sw086 kernel: RAX: 0000000000000002 RBX:
ffff810116af9f50 RCX: 0000000000000000
Dec 27 15:36:25 sw086 kernel: RDX: ffffffff80364eea RSI:
00000000ffffffff RDI: ffff81011fdf0a01
Dec 27 15:36:25 sw086 kernel: RBP: ffff81011bf49740 R08:
00000000fffffffb R09: 0000000000000000
Dec 27 15:36:25 sw086 kernel: R10: 0000000000000000 R11:
0000000000000000 R12: ffff81011fdf0a10
Dec 27 15:36:25 sw086 kernel: R13: ffff81012017df50 R14:
ffffffff80507f10 R15: ffffffff8826ada0
Dec 27 15:36:25 sw086 kernel: FS:  00002b1118f8cde0(0000)
GS:ffff810123477c40(0000) knlGS:0000000000000000
Dec 27 15:36:25 sw086 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Dec 27 15:36:25 sw086 kernel: CR2: 0000000000000001 CR3:
0000000120893000 CR4: 00000000000006e0
Dec 27 15:36:25 sw086 kernel: Process udev (pid: 11363, threadinfo
ffff81012017c000, task ffff8101168e7140)
Dec 27 15:36:25 sw086 kernel: Stack:  ffffffff802b5c7a 0000000000000101
0000000000001000 000000000064a970
Dec 27 15:36:25 sw086 kernel:  0000000000001000 0000000000000000
ffff81011b983800 000000000064a970
Dec 27 15:36:25 sw086 kernel:  ffff81012017df50 0000000000615a80
ffffffff80275727 ffff81011b983800
Dec 27 15:36:25 sw086 kernel: Call Trace:
Dec 27 15:36:25 sw086 kernel:  [<ffffffff802b5c7a>]
sysfs_read_file+0xaf/0x142
Dec 27 15:36:25 sw086 kernel:  [<ffffffff80275727>] vfs_read+0xd1/0x172
Dec 27 15:36:25 sw086 kernel:  [<ffffffff80275a8d>] sys_read+0x45/0x6e
Dec 27 15:36:25 sw086 kernel:  [<ffffffff8020951e>] system_call+0x7e/0x83
Dec 27 15:36:25 sw086 kernel:
Dec 27 15:36:25 sw086 kernel:
Dec 27 15:36:25 sw086 kernel: Code:  Bad RIP value.
Dec 27 15:36:25 sw086 kernel: RIP  [<0000000000000001>]
Dec 27 15:36:25 sw086 kernel:  RSP <ffff81012017dec0>
Dec 27 15:36:25 sw086 kernel: CR2: 0000000000000001


thanks
Dotan


From ogerlitz at voltaire.com  Thu Dec 28 02:10:02 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 28 Dec 2006 12:10:02 +0200
Subject: [openib-general] No resource tracking per qp for multicast
 groups
In-Reply-To: <200612281009.32750.jackm@dev.mellanox.co.il>
References: <6AB138A2AB8C8E4A98B9C0C3D52670E38798D3@mtlexch01.mtl.com>
	<458FC332.1010801@voltaire.com> <4592D0E4.2010006@ichips.intel.com>
	<200612281009.32750.jackm@dev.mellanox.co.il>
Message-ID: <459397FA.4000306@voltaire.com>

Jack Morgenstein wrote:
> On Wednesday 27 December 2006 22:00, Sean Hefty wrote:
>> I do not know if this problem was solved however.
>>
> The problem was solved in ofed 1.1.  See file core/uverbs_main.c, 
> procedure ib_uverbs_cleanup_ucontext():
> 
> 	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
> 		struct ib_qp *qp = uobj->object;
> 		struct ib_uqp_object *uqp =
> 			container_of(uobj, struct ib_uqp_object, uevent.uobject);
> 
> 		idr_remove_uobj(&ib_uverbs_qp_idr, uobj);
> -->		ib_uverbs_detach_umcast(qp, uqp);
> 		ib_destroy_qp(qp);
> 		ib_uverbs_release_uevent(file, &uqp->uevent);
> 		kfree(uqp);
> 	}

OK, Jack I see now that your patch fixing this was committed by Linus on 
Nov 2005 
(http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=f4e401562c11c7ca65592ebd749353cf0b19af7b)

Or.


From ogerlitz at voltaire.com  Thu Dec 28 02:25:13 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 28 Dec 2006 12:25:13 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4592817B.3030700@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com>
	<4592817B.3030700@mellanox.co.il>
Message-ID: <45939B89.9020305@voltaire.com>

Eitan Zahavi wrote:
> Or Gerlitz wrote:
>>> Assuming the value M represents the lowest MTU on the path
>>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
>>> R represents the MTU value in the request. Similarly R-1 is one below 
>>> R and R+1 is one above R.
>>>
>>> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM 
>>> Quirk w. Tavor End Port
>>> ----------------------------------------------------------------------------------------- 
>>>
>>> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
>>> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, 
>>> M, 1K)
>>> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R 
>>> /ERR
>>> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if 
>>> M>R /ERR

>> 1st maybe its clear to everyone expect me, but what do you mean by 
>> /ERR in the table above, is it what opensm would return before the 
>> patch you suggested?

> By ERR I mean that the path being evaluated is rejected from being 
> included in the paths group of the response to the provided query.

so when you say

"X if some relation holds on (Y,Z) /ERR"

you mean that it "should return X but if r(Y,Z) holds return no record" 
and this how the code is written with the patch?

>> 2nd can you post the open sm tavor quirk patch?
>>   
> What do you mean? The old patch introducing the "opensm quirk" mode?
> It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by 
> 03e3b3a6fa934202c0f4270a2c69d64ac486b1ca
> or SVN: 9497 followed by 9518

OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 
branch? can be cool if you send a pointer to the SVN...

Or.


From ogerlitz at voltaire.com  Thu Dec 28 02:26:28 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 28 Dec 2006 12:26:28 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <20061227174245.GC5377@mellanox.co.il>
References: <45927D3A.9030502@voltaire.com>
	<20061227174245.GC5377@mellanox.co.il>
Message-ID: <45939BD4.90204@voltaire.com>

Michael S. Tsirkin wrote:
>> 3rd Eitan/Michael: what is the bigger picture here? what is the 
>> dependency between these four patches
> 
> In short, [2] is an independent fix to improve tavor performance.
> Other things are not directly related. Detail below.
> 
>> +1 osm:Fix PathRecord bug MTU/rate/PktLife explicitly ignoring selectors
>> +2 osm: tavor quirk
>> +3 IB/rdmacm: tavor quirk
>> +4 IB/ipoib: use appropriate mtu selector for path queries
> 
> In the above:
> [1] is a bug fix I think. It is not required for [2].
> [2] is a feature that improves performance for tavor without need for
>     any other stack/ULP changes
> [3] is a hack that should have same effect as [2] for old SMs, but it needs
>     manual tuning by user. If activated, it unfortunately triggers a bug in opensm
>     that [1] fixes. So it might not be a good idea after all.
> [4] is not strictly necessary, and not related to this patch set -
>     it just happens to also play with MTU selector.
>     It is a strict compliance cleanup that I just happened to notice when
>     I invented [2].
> 

OK, Michael, thanks for the clarifications.

Or.


From eitan at mellanox.co.il  Thu Dec 28 02:46:03 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Thu, 28 Dec 2006 12:46:03 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <45939B89.9020305@voltaire.com>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com>
	<4592817B.3030700@mellanox.co.il> <45939B89.9020305@voltaire.com>
Message-ID: <4593A06B.4010706@mellanox.co.il>

Or Gerlitz wrote:
> Eitan Zahavi wrote:
>   
>> Or Gerlitz wrote:
>>     
>>>> Assuming the value M represents the lowest MTU on the path
>>>> We denote by M-1 the MTU value one level below M (e.g. 1K if M=2K)
>>>> R represents the MTU value in the request. Similarly R-1 is one below 
>>>> R and R+1 is one above R.
>>>>
>>>> Query-MTU | Query-Sel | Resp by Spec     | OpenSM Should  | OpenSM 
>>>> Quirk w. Tavor End Port
>>>> ----------------------------------------------------------------------------------------- 
>>>>
>>>> UNDEFINED | UNDEFINED | <= M             | M              | min(M,1K)
>>>> R         | <         | <= min(R-1, M)   | min(R-1, M)    | min(R-1, 
>>>> M, 1K)
>>>> R         | =         | R if M>=R /ERR   | R if M>=R /ERR | R if M>=R /ERR
>>>> R         | >         | R < <= M         | R+1 if M>R /ERR| R+1 if M>R /ERR
>>>>         
>
>   
>>> 1st maybe its clear to everyone expect me, but what do you mean by 
>>> /ERR in the table above, is it what opensm would return before the 
>>> patch you suggested?
>>>       
>
>   
>> By ERR I mean that the path being evaluated is rejected from being 
>> included in the paths group of the response to the provided query.
>>     
>
> so when you say
>
> "X if some relation holds on (Y,Z) /ERR"
>
> you mean that it "should return X but if r(Y,Z) holds return no record" 
> and this how the code is written with the patch?
>
>   
No:
R if M>=R /ERR mean:
Return R if M is bigger or equal to R or else this path does not match 
the request.

R+1 if M>R /ERR meas:
Return R+1 if M is bigger then R or else this path does not match the request.

If no paths math the request you the response depends on the query method:
For Get(PathRecord) you will get an error.
For GetTable(PathRecord) you will get zero number of returned records
For GetMulti(MultiPathRecord) you should get zero number of returned records

EZ
>>> 2nd can you post the open sm tavor quirk patch?
>>>   
>>>       
>> What do you mean? The old patch introducing the "opensm quirk" mode?
>> It is GIT versions: 86077144ed956ddb32a0f8d067d5bb00fd564ac6 followed by 
>> 03e3b3a6fa934202c0f4270a2c69d64ac486b1ca
>> or SVN: 9497 followed by 9518
>>     
>
> OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 
> branch? can be cool if you send a pointer to the SVN...
>   
This is trunk
> Or.
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From ogerlitz at voltaire.com  Thu Dec 28 02:50:05 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 28 Dec 2006 12:50:05 +0200
Subject: [openib-general] tavor quirks etc (opensm compliance etc)
In-Reply-To: <4593A06B.4010706@mellanox.co.il>
References: <4587F6E0.10000@voltaire.com>
	<20061219160221.GE3428@mellanox.co.il> <4588EAB9.6080106@voltaire.com>
	<458E7402.4000106@mellanox.co.il> <45927D3A.9030502@voltaire.com>
	<4592817B.3030700@mellanox.co.il> <45939B89.9020305@voltaire.com>
	<4593A06B.4010706@mellanox.co.il>
Message-ID: <4593A15D.6020608@voltaire.com>

Eitan Zahavi wrote:
> Or Gerlitz wrote:

>> so when you say
>> "X if some relation holds on (Y,Z) /ERR"
>> you mean that it "should return X but if r(Y,Z) holds return no 
>> record" and this how the code is written with the patch?

> No:
> R if M>=R /ERR mean:
> Return R if M is bigger or equal to R or else this path does not match 
> the request.
> 
> R+1 if M>R /ERR meas:
> Return R+1 if M is bigger then R or else this path does not match the 
> request.

Got it, thanks.

>> OK, thanks, i guess you mean to the svn trunk or its the ofed 1.1 
>> branch? can be cool if you send a pointer to the SVN...
>>   
> This is trunk

OK

Or.


From ogerlitz at voltaire.com  Thu Dec 28 02:57:33 2006
From: ogerlitz at voltaire.com (Or Gerlitz)
Date: Thu, 28 Dec 2006 12:57:33 +0200
Subject: [openib-general] SVN deprecation
In-Reply-To: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
References: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
Message-ID: <4593A31D.50808@voltaire.com>

Jeff Squyres wrote:
> I propose "svn rm"'ing unused trees in the SVN repository and leaving  
> README files indicating that everything has moved to git (remember:  
> everything is still available via the SVN history).  If no one has  
> any objections, I'll do this on Friday, 5 Jan 2007.

> KEEP the following:
> - /gen2/branches/1.1: by request (Tziporet)

> REMOVE the following:
> - /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!)

I guess you refer to gen2/trunk/src

please no.

Lets leave these sources with a readme stating they are unmaintained 
along with the gen2/branches/1.1 sources at least for the dev/release 
cycle of OFED 1.2

Or.


From halr at voltaire.com  Thu Dec 28 06:27:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Dec 2006 09:27:19 -0500
Subject: [openib-general] [PATCH] diags: fix loops handling in
	ibnetdiscover
In-Reply-To: <20061227230915.GF11268@sashak.voltaire.com>
References: <20061227230915.GF11268@sashak.voltaire.com>
Message-ID: <1167316029.29620.142536.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 18:09, Sasha Khapyorsky wrote:
> This fixes loop cabling and loopback connections handling in
> ibnetdiscover.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu Dec 28 06:29:27 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Dec 2006 09:29:27 -0500
Subject: [openib-general] [PATCH] diags: eliminate __WORDSIZE ifdefs for
	printing
In-Reply-To: <20061227231017.GG11268@sashak.voltaire.com>
References: <20061227231017.GG11268@sashak.voltaire.com>
Message-ID: <1167316051.29620.142538.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 18:10, Sasha Khapyorsky wrote:
> Use portable PRIx64 macro in printf format strings instead of using
> '#if __WORDSIZE == 64' with printf style functions.
> 
> Signed-off-by: Sasha Khapyorsky <sashak at voltaire.com>

Thanks. Applied.

-- Hal


From halr at voltaire.com  Thu Dec 28 07:06:50 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Dec 2006 10:06:50 -0500
Subject: [openib-general] [PATCH] OpenSM: Remove use of osm_svn_revision.h
Message-ID: <1167318395.29620.144439.camel@hal.voltaire.com>

OpenSM: Remove use of osm_svn_revision.h

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index 3ef246c..aed60d7 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -10,27 +10,6 @@ DBGFLAGS = -g
 endif
 
 if OSMV_OPENIB
-BUILT_SOURCES = $(srcdir)/../include/opensm/osm_svn_revision.h
-.PHONY: always
-$(srcdir)/../include/opensm/osm_svn_revision.h: always
-	echo -n "#define OSM_SVN_REVISION \"" >$(srcdir)/../include/opensm/osm_svn_revision_new.h ; \
-	if test '!' -d '$(srcdir)/.svn'; then \
-		echo -n Exported revision >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \
-	else \
-		svnversion -n $(srcdir)/.. >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \
-	fi ; \
-	echo "\"" >> $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \
-	if cmp -s $(srcdir)/../include/opensm/osm_svn_revision_new.h \
-		  $(srcdir)/../include/opensm/osm_svn_revision.h ; \
-	then \
-		rm $(srcdir)/../include/opensm/osm_svn_revision_new.h ; \
-	else \
-		mv $(srcdir)/../include/opensm/osm_svn_revision_new.h \
-		   $(srcdir)/../include/opensm/osm_svn_revision.h ; \
-	fi
-endif
-
-if OSMV_OPENIB
 libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT -DDUAL_SIDED_RMPP $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
 else
 libopensm_la_CFLAGS = -Wall $(OSMV_CFLAGS) -DVENDOR_RMPP_SUPPORT $(DBGFLAGS) -D_XOPEN_SOURCE=600 -D_BSD_SOURCE=1
diff --git a/osm/opensm/main.c b/osm/opensm/main.c
index bc916ab..ee09db0 100644
--- a/osm/opensm/main.c
+++ b/osm/opensm/main.c
@@ -54,9 +54,6 @@
 #include <complib/cl_debug.h>
 #include <vendor/osm_vendor_api.h>
 #include <opensm/osm_version.h>
-#ifdef OSM_VENDOR_INTF_OPENIB
-#include <opensm/osm_svn_revision.h>
-#endif
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_console.h>
 
@@ -599,10 +596,6 @@ main(
 
   printf("-------------------------------------------------\n");
   printf("%s\n", OSM_VERSION);
-#if defined ( OSM_VENDOR_INTF_OPENIB )
-  if (strlen(OSM_SVN_REVISION))
-     printf("Based on OpenIB svn %s\n", OSM_SVN_REVISION);
-#endif
 
   osm_subn_set_default_opt(&opt);
   osm_subn_parse_conf_file(&opt);
diff --git a/osm/opensm/osm_opensm.c b/osm/opensm/osm_opensm.c
index 9cac636..0061193 100644
--- a/osm/opensm/osm_opensm.c
+++ b/osm/opensm/osm_opensm.c
@@ -57,9 +57,6 @@
 #include <complib/cl_passivelock.h>
 #include <vendor/osm_vendor_api.h>
 #include <opensm/osm_version.h>
-#ifdef OSM_VENDOR_INTF_OPENIB
-#include <opensm/osm_svn_revision.h>
-#endif
 #include <opensm/osm_base.h>
 #include <opensm/osm_opensm.h>
 #include <opensm/osm_log.h>
@@ -204,33 +201,12 @@ osm_opensm_init(
    if( status != IB_SUCCESS )
       return ( status );
 
-#ifndef OSM_VENDOR_INTF_OPENIB
    /* If there is a log level defined - add the OSM_VERSION to it. */
    osm_log( &p_osm->log,
             osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n",
             OSM_VERSION );
    /* Write the OSM_VERSION to the SYS_LOG */
    osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION );   /* Format Waived */
-#else
-   if (strlen(OSM_SVN_REVISION))
-   {
-      /* If there is a log level defined - add OSM_VERSION and OSM_SVN_REVISION to it. */
-      osm_log( &p_osm->log,
-               osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s OpenIB svn %s\n",
-               OSM_VERSION, OSM_SVN_REVISION );
-      /* Write the OSM_VERSION and OSM_SVN_REVISION to the SYS_LOG */
-      osm_log( &p_osm->log, OSM_LOG_SYS, "%s OpenIB svn %s\n", OSM_VERSION, OSM_SVN_REVISION );   /* Format Waived */
-   }
-   else
-   {
-      /* If there is a log level defined - add the OSM_VERSION to it. */
-      osm_log( &p_osm->log,
-               osm_log_get_level( &p_osm->log ) & ( OSM_LOG_SYS ^ 0xFF ), "%s\n",
-               OSM_VERSION );
-      /* Write the OSM_VERSION to the SYS_LOG */
-      osm_log( &p_osm->log, OSM_LOG_SYS, "%s\n", OSM_VERSION );   /* Format Waived */
-   }
-#endif
 
    osm_log( &p_osm->log, OSM_LOG_FUNCS, "osm_opensm_init: [\n" ); /* Format Waived */
 

From tziporet at dev.mellanox.co.il  Thu Dec 28 07:31:57 2006
From: tziporet at dev.mellanox.co.il (Tziporet Koren)
Date: Thu, 28 Dec 2006 17:31:57 +0200
Subject: [openib-general] SVN deprecation
In-Reply-To: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
References: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
Message-ID: <4593E36D.6020001@dev.mellanox.co.il>

Jeff Squyres wrote:
> I propose "svn rm"'ing unused trees in the SVN repository and leaving  
> README files indicating that everything has moved to git (remember:  
> everything is still available via the SVN history).  If no one has  
> any objections, I'll do this on Friday, 5 Jan 2007.
>
> ** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments!   
> Otherwise, things may disappear from SVN that you didn't expect.
>
> UNKNOWN whether to keep or remove:
> (i.e., they seem to have "recent" development)
> ==============================================
>
> DEVELOPER  MTIME     PATH
> ---------  --------  ----------------------------------
> dotanb     Dec 2006  /trunk/contrib/mellanox
> vlad       Dec 2006  /gen2/trunk/ofed
> swise      Oct 2006  /gen2/branches/iwarp
> hnguyen    Sep 2006  /trunk/contrib/ibm
> amitk      Sep 2006  /gen2/branches/1.0
> vlad       Sep 2006  /gen2/branches/ofed_fixes
> monil      Sep 2006  /gen2/branches/backport
> woody      Sep 2006  /gen2/branches/backport-to-2.6.9
> halr       May 2006  /gen2/branches/ibat
> mst        Jul 2006  /gen2/branches/mellanox_fixes
>
> KEEP the following:
> ===================
>
> - /gen2/branches/1.1: by request (Tziporet)
>
> REMOVE the following:
> =====================
>
> In short, everything will be removed except what was listed above.   
> However, to be explicit, some more entries are listed below.
>
> (*) entries mean "everything except what was already listed above"
>
> Remove these trees based on the fact that they haven't changed in a  
> long time:
>
> MTIME     PATH
> --------- ------------------------------
> Apr 2006  /trunk/contrib/*
> Apr 2006  /trunk/branches/*
> Apr 2006  /gen2/ulps
> Apr 2006  /gen2/branches/*
> Mar 2006  /gen2/users
> May 2005  /gen1
> Jan 2005  /gen2/trunk/arch
> Dec 2004  /gen2/utils
> Nov 2004  /gen2/trunk/scripts
> Jul 2004  /tags
> Apr 2004  /trunk/openib
>
>
>   
There are some important directories under /trunk/contrib/mellanox so 
please don't remove them:
gen1/ib_srpt - this is the srp target code Mellanox opened - Vu can you 
open a git tree with it instead?
ibtp/ 
<https://staging.openfabrics.org/svn/openib/trunk/contrib/mellanox/ibtp/> 
- these are tests we posted - Dotan - can you create git tree for the tests

Please also save gen2/branches/1.0/ since it was used for 1.0 release

thanks
Tziporet


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061228/6c3c2fa4/attachment.html>

From mshefty at ichips.intel.com  Thu Dec 28 09:16:02 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Dec 2006 09:16:02 -0800
Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and
 restarting the driver several times causes kernel oops
In-Reply-To: <459381DA.7030007@dev.mellanox.co.il>
References: <459381DA.7030007@dev.mellanox.co.il>
Message-ID: <4593FBD2.4000109@ichips.intel.com>

Dotan Barak wrote:
> here is the backtrace from the /var/log/messages:
> Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer
> dereference at 0000000000000001 RIP:
> Dec 27 15:36:25 sw086 kernel:  [<0000000000000001>]
> Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0
> Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP
> Dec 27 15:36:25 sw086 kernel: CPU 1
> Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm
> iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u
> verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp
> parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod
> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg
> ext3 jbd sd_mod

Can you narrow down which module unload is causing the issue?  Is anything using 
the rdma_ucm or ib_uverbs?  Is ib_sdp the first module unloaded?

- Sean


From robert.j.woodruff at intel.com  Thu Dec 28 11:46:57 2006
From: robert.j.woodruff at intel.com (Woodruff, Robert J)
Date: Thu, 28 Dec 2006 11:46:57 -0800
Subject: [openib-general] SVN deprecation
Message-ID: <BAE9DCEF64577A439B3A37F36F9B691C0165F182@orsmsx418.amr.corp.intel.com>

Jeff Squyres wrote: 

>	I propose "svn rm"'ing unused trees in the SVN repository and
leaving  
>	README files indicating that everything has moved to git
(remember:  
>	everything is still available via the SVN history).  If no one
has  
>	any objections, I'll do this on Friday, 5 Jan 2007.

Please keep this 

woody      Sep 2006  /gen2/branches/backport-to-2.6.9

until I find out if anyone is still using the old backport patches and
RPMS.
These were not moved to git and there are no plans to move them to git.
 
woody


From halr at voltaire.com  Thu Dec 28 13:12:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 28 Dec 2006 16:12:18 -0500
Subject: [openib-general] [PATCH] OpenSM/osm_sa_lft_record.c: In
 __osm_lftr_rcv_by_comp_mask, when BlockNum component is wildcarded,
 fix max_block calculation
Message-ID: <1167340337.29620.163416.camel@hal.voltaire.com>

OpenSM/osm_sa_lft_record.c: In __osm_lftr_rcv_by_comp_mask, when
BlockNum component is wildcarded, fix max_block calculation

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/opensm/osm_sa_lft_record.c b/osm/opensm/osm_sa_lft_record.c
index 7d37074..46bebf2 100644
--- a/osm/opensm/osm_sa_lft_record.c
+++ b/osm/opensm/osm_sa_lft_record.c
@@ -226,7 +226,6 @@ __osm_lftr_rcv_by_comp_mask(
   osm_port_t*               p_port;
   uint16_t                  min_lid_ho, max_lid_ho;
   uint16_t                  min_block, max_block, block;
-  uint16_t                  lids_per_block;
   const osm_physp_t*        p_physp;
 
   /* In switches, the port guid is the node guid. */
@@ -283,10 +282,9 @@ __osm_lftr_rcv_by_comp_mask(
   }
   else
   {
-    /* use as many blocks as possible */
+    /* use as many blocks as "in use" */
     min_block = 0;
-    lids_per_block = osm_fwd_tbl_get_lids_per_block( osm_switch_get_fwd_tbl_ptr( p_sw ) );
-    max_block = (max_lid_ho + lids_per_block - 1)/lids_per_block;
+    max_block = osm_switch_get_max_block_id_in_use(p_sw);
   }
 
   /* so we can add these blocks one by one ... */


From Leonid.Grossman at neterion.com  Thu Dec 28 13:24:09 2006
From: Leonid.Grossman at neterion.com (Leonid Grossman)
Date: Thu, 28 Dec 2006 16:24:09 -0500
Subject: [openib-general] one vs. two drivers for an iWARP-capable Ethernet
	NIC
Message-ID: <78C9135A3D2ECE4B8162EBDCE82CAD77010FB433@nekter>

Jeff/Roland/all,
What is the preferred submission driver model for an iWARP-capable
Ethernet NIC - two separate drivers (Ethernet and OpenFabrics) that
interact with each other, or a single driver that supports both
OpenFabrics and Ethernet interfaces?
For our hardware we can go either way, although in case of separate
drivers the interface between the two would get somewhat artificial...
 
Thanks, Leonid
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061228/ba3cdd87/attachment.html>

From Leonid.Grossman at neterion.com  Thu Dec 28 13:31:13 2006
From: Leonid.Grossman at neterion.com (Leonid Grossman)
Date: Thu, 28 Dec 2006 16:31:13 -0500
Subject: [openib-general] one vs. two drivers for an iWARP-capable
	Ethernet NIC
Message-ID: <78C9135A3D2ECE4B8162EBDCE82CAD77010FB436@nekter>

Re-sending as a plain text to reach netdev.
Sorry for the extra traffic, please ignore the earlier html version of
this e-mail...
------------------------------------------------------------
	
Jeff/Roland/all,
What is the preferred submission driver model for an iWARP-capable
Ethernet NIC - two separate drivers (Ethernet and OpenFabrics) that
interact with each other, or a single driver that supports both
OpenFabrics and Ethernet interfaces?

For our hardware we can go either way, although in case of separate
drivers the interface between the two would get somewhat artificial...
	 
	Thanks, Leonid


From mshefty at ichips.intel.com  Thu Dec 28 14:46:57 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Dec 2006 14:46:57 -0800
Subject: [openib-general] rdma-dev git tree updated to 2.6.20-rc2
Message-ID: <45944961.3050402@ichips.intel.com>

My git tree has been updated to help support OFED 1.2 testing.


From mshefty at ichips.intel.com  Thu Dec 28 15:23:01 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Dec 2006 15:23:01 -0800
Subject: [openib-general] [PATCH] librdmacm Pass back the status or
 errno in RDMA CM events.
In-Reply-To: <20061215225606.22765.18276.stgit@dell3.ogc.int>
References: <20061215225606.22765.18276.stgit@dell3.ogc.int>
Message-ID: <459451D5.8000705@ichips.intel.com>

Steve Wise wrote:
> The librdmacm code isn't passing back the errno in all events.
> 
> For example, if a connection request times out the kernel CMA will pass
> up event RDMA_CM_EVENT_UNREACHABLE with the status set to -ETIMEDOUT.
> This errno isn't currently passed back to the librdmacm user in the event.

Thanks - committed.

- Sean


From mshefty at ichips.intel.com  Thu Dec 28 15:25:33 2006
From: mshefty at ichips.intel.com (Sean Hefty)
Date: Thu, 28 Dec 2006 15:25:33 -0800
Subject: [openib-general] [PATCH] rdma_cm iWARP connection setup
 timeouts reported as rejects.
In-Reply-To: <20061215225017.22628.17881.stgit@dell3.ogc.int>
References: <20061215225017.22628.17881.stgit@dell3.ogc.int>
Message-ID: <4594526D.1000309@ichips.intel.com>

Steve Wise wrote:
> The IWCM should report timeouts as event RDMA_CM_EVENT_UNREACHABLE,
> not event RDMA_CM_EVENT_REJECTED.
> 
> Signed-off-by: Steve Wise <swise at opengridcomputing.com>

Looks fine to me.  Can we pull this into 2.6.20?

Signed-off-by: Sean Hefty <sean.hefty at intel.com>


From bunk at stusta.de  Thu Dec 28 18:10:09 2006
From: bunk at stusta.de (Adrian Bunk)
Date: Fri, 29 Dec 2006 03:10:09 +0100
Subject: [openib-general] [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c: make
 functions static
In-Reply-To: <20061228024237.375a482f.akpm@osdl.org>
References: <20061228024237.375a482f.akpm@osdl.org>
Message-ID: <20061229021009.GN20714@stusta.de>

On Thu, Dec 28, 2006 at 02:42:37AM -0800, Andrew Morton wrote:
>...
> Changes since 2.6.20-rc1-mm1:
>...
>  git-infiniband.patch
>...
>  git trees
>...


This patch makes some needlessly global functions static.

Signed-off-by: Adrian Bunk <bunk at stusta.de>

---

 drivers/infiniband/ulp/ipoib/ipoib_cm.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

--- linux-2.6.20-rc2-mm1/drivers/infiniband/ulp/ipoib/ipoib_cm.c.old	2006-12-29 01:40:17.000000000 +0100
+++ linux-2.6.20-rc2-mm1/drivers/infiniband/ulp/ipoib/ipoib_cm.c	2006-12-29 01:43:22.000000000 +0100
@@ -56,7 +56,8 @@
 	u32 remote_mtu;
 };
 
-int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event);
+static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
+			       struct ib_cm_event *event);
 
 static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv,
 				  dma_addr_t mapping[IPOIB_CM_RX_SG])
@@ -265,7 +266,8 @@
 	return ret;
 }
 
-int ipoib_cm_rx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+static int ipoib_cm_rx_handler(struct ib_cm_id *cm_id,
+			       struct ib_cm_event *event)
 {
 	struct ipoib_cm_rx *p;
 	struct ipoib_dev_priv *priv;
@@ -396,7 +398,7 @@
 			   "for buf %d\n", wr_id);
 }
 
-void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr)
+static void ipoib_cm_rx_completion(struct ib_cq *cq, void *dev_ptr)
 {
 	struct net_device *dev = (struct net_device *) dev_ptr;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -550,7 +552,7 @@
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 }
 
-void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
+static void ipoib_cm_tx_completion(struct ib_cq *cq, void *tx_ptr)
 {
 	struct ipoib_cm_tx *tx = tx_ptr;
 	int n, i;
@@ -768,7 +770,8 @@
 	return 0;
 }
 
-int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn, struct ib_sa_path_rec *pathrec)
+static int ipoib_cm_tx_init(struct ipoib_cm_tx *p, u32 qpn,
+			    struct ib_sa_path_rec *pathrec)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
 	int ret;
@@ -841,7 +844,7 @@
 	return ret;
 }
 
-void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
+static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(p->dev);
 	struct ipoib_tx_buf *tx_req;
@@ -875,7 +878,8 @@
 	kfree(p);
 }
 
-int ipoib_cm_tx_handler(struct ib_cm_id *cm_id, struct ib_cm_event *event)
+static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
+			       struct ib_cm_event *event)
 {
 	struct ipoib_cm_tx *tx = cm_id->context;
 	struct ipoib_dev_priv *priv = netdev_priv(tx->dev);
@@ -960,7 +964,7 @@
 	}
 }
 
-void ipoib_cm_tx_start(struct work_struct *work)
+static void ipoib_cm_tx_start(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
 		container_of(work, struct ipoib_dev_priv, cm.start_task);
@@ -1003,7 +1007,7 @@
 	spin_unlock_irqrestore(&priv->tx_lock, flags);
 }
 
-void ipoib_cm_tx_reap(struct work_struct *work)
+static void ipoib_cm_tx_reap(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv =
 		container_of(work, struct ipoib_dev_priv, cm.reap_task);


From eitan at sw053.yok.mtl.com  Thu Dec 28 21:28:56 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Fri, 29 Dec 2006 07:28:56 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-29:normal completion
Message-ID: <200612290528.kBT5SuGS015171@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Thu_Dec_28_12:00:53_2006 298216 
ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
Total=351 Pass=351 Fail=0

Pass:
39 Stability IS1-16.topo
39 Pkey IS1-16.topo
39 OsmTest IS1-16.topo
39 OsmStress IS1-16.topo
39 Multicast IS1-16.topo
39 LidMgr IS1-16.topo
13 Stability IS3-loop.topo
13 Stability IS3-128.topo
13 Pkey IS3-128.topo
13 OsmTest IS3-loop.topo
13 OsmTest IS3-128.topo
13 OsmStress IS3-128.topo
13 Multicast IS3-loop.topo
13 Multicast IS3-128.topo
13 LidMgr IS3-128.topo

Failures:


From mst at mellanox.co.il  Thu Dec 28 21:39:30 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Fri, 29 Dec 2006 07:39:30 +0200
Subject: [openib-general] [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c:
 make functions static
In-Reply-To: <20061229021009.GN20714@stusta.de>
References: <20061228024237.375a482f.akpm@osdl.org>
	<20061229021009.GN20714@stusta.de>
Message-ID: <20061229053930.GA4580@mellanox.co.il>

> Quoting Adrian Bunk <bunk at stusta.de>:
> Subject: [-mm patch] infiniband/ulp/ipoib/ipoib_cm.c: make functions static
> 
> On Thu, Dec 28, 2006 at 02:42:37AM -0800, Andrew Morton wrote:
> >...
> > Changes since 2.6.20-rc1-mm1:
> >...
> >  git-infiniband.patch
> >...
> >  git trees
> >...
> 
> 
> This patch makes some needlessly global functions static.
> 
> Signed-off-by: Adrian Bunk <bunk at stusta.de>

Thanks, I'll put this in my tree.

-- 
MST


From halr at voltaire.com  Fri Dec 29 08:26:47 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 11:26:47 -0500
Subject: [openib-general] [PATCH] OpenSM/ib_types.h: Add support for SA
	MFTRecord
Message-ID: <1167409604.29620.225320.camel@hal.voltaire.com>

OpenSM/ib_types.h: Add support for SA MFTRecord

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/iba/ib_types.h b/osm/include/iba/ib_types.h
index 1770f8d..738bd7f 100644
--- a/osm/include/iba/ib_types.h
+++ b/osm/include/iba/ib_types.h
@@ -1283,6 +1283,18 @@ ib_class_is_rmpp(
 #define IB_MAD_ATTR_LFT_RECORD				(CL_NTOH16(0x0015))
 /**********/
 
+/****d* IBA Base: Constants/IB_MAD_ATTR_MFT_RECORD
+* NAME
+*       IB_MAD_ATTR_MFT_RECORD
+*
+* DESCRIPTION
+*       MulticastForwardingTableRecord attribute (15.2.5.8)
+*
+* SOURCE
+*/
+#define IB_MAD_ATTR_MFT_RECORD				(CL_NTOH16(0x0017))
+/**********/
+
 /****d* IBA Base: Constants/IB_MAD_ATTR_PKEYTBL_RECORD
 * NAME
 *	IB_MAD_ATTR_PKEYTBL_RECORD
@@ -2371,6 +2383,13 @@ typedef struct _ib_path_rec
 #define IB_LFTR_COMPMASK_LID              (CL_HTON64(((uint64_t)1)<<0))
 #define IB_LFTR_COMPMASK_BLOCK            (CL_HTON64(((uint64_t)1)<<1))
 
+/* MFT Record Masks */
+#define IB_MFTR_COMPMASK_LID		  (CL_HTON64(((uint64_t)1)<<0))
+#define IB_MFTR_COMPMASK_POSITION	  (CL_HTON64(((uint64_t)1)<<1))
+#define IB_MFTR_COMPMASK_RESERVED1	  (CL_HTON64(((uint64_t)1)<<2))
+#define IB_MFTR_COMPMASK_BLOCK		  (CL_HTON64(((uint64_t)1)<<3))
+#define IB_MFTR_COMPMASK_RESERVED2	  (CL_HTON64(((uint64_t)1)<<4))
+
 /* NodeInfo Record Masks */
 #define IB_NR_COMPMASK_LID                (CL_HTON64(((uint64_t)1)<<0))
 #define IB_NR_COMPMASK_RESERVED1          (CL_HTON64(((uint64_t)1)<<1))
@@ -5530,6 +5549,26 @@ typedef struct _ib_lft_record
 #include <complib/cl_packoff.h>
 /************/
 
+/****s* IBA Base: Types/ib_mft_record_t
+* NAME
+*	ib_mft_record_t
+*
+* DESCRIPTION
+*	IBA defined MulticastForwardingTableRecord (15.2.5.8)
+*
+* SYNOPSIS
+*/
+#include <complib/cl_packon.h>
+typedef struct _ib_mft_record
+{
+	ib_net16_t		lid;
+	ib_net16_t		position_block_num;
+	uint32_t		resv0;
+	ib_net16_t		mft[IB_MCAST_BLOCK_SIZE];
+}	PACK_SUFFIX ib_mft_record_t;
+#include <complib/cl_packoff.h>
+/************/
+
 /****s* IBA Base: Types/ib_switch_info_t
 * NAME
 *	ib_switch_info_t


From halr at voltaire.com  Fri Dec 29 08:34:19 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 11:34:19 -0500
Subject: [openib-general] [PATCH] osm: fat-tree documentation
In-Reply-To: <1167240747.29620.77561.camel@hal.voltaire.com>
References: <45929D0B.3090308@dev.mellanox.co.il>
	<1167240747.29620.77561.camel@hal.voltaire.com>
Message-ID: <1167410047.29620.225730.camel@hal.voltaire.com>

On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote:
> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote:
> > Hi Hal.
> > 
> > Added fat-tree routing details and some cosmetics in the txt files.
> > 
> > --
> > Yevgeny
> > 
> > Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> 
> Thanks. Applied.
> 
> A couple of minor questions:
> 
> Should similar text as in current-routing.txt be added to the OpenSM man
> page ?

I took care of making the man page including the fat tree routing
information you put into current-routing.txt.

The question below is outstanding:

> Also, rather than HCA in the below, is CA better (to include TCAs as
> well) ?

Thanks.

-- Hal

> -- Hal


From halr at voltaire.com  Fri Dec 29 08:39:09 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 11:39:09 -0500
Subject: [openib-general] SVN deprecation
In-Reply-To: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
References: <BE0EC169-5375-4F0F-93A0-5B80B4F32F4A@cisco.com>
Message-ID: <1167410068.29620.225732.camel@hal.voltaire.com>

Hi Jeff,

On Wed, 2006-12-27 at 12:02, Jeff Squyres wrote:
> I propose "svn rm"'ing unused trees in the SVN repository and leaving  
> README files indicating that everything has moved to git (remember:  
> everything is still available via the SVN history).  If no one has  
> any objections, I'll do this on Friday, 5 Jan 2007.
> 
> ** PLEASE READ THE FOLLOWING CAREFULLY and send in your comments!   
> Otherwise, things may disappear from SVN that you didn't expect.
> 
> UNKNOWN whether to keep or remove:
> (i.e., they seem to have "recent" development)
> ==============================================
> 
> DEVELOPER  MTIME     PATH
> ---------  --------  ----------------------------------
> dotanb     Dec 2006  /trunk/contrib/mellanox
> vlad       Dec 2006  /gen2/trunk/ofed
> swise      Oct 2006  /gen2/branches/iwarp
> hnguyen    Sep 2006  /trunk/contrib/ibm
> amitk      Sep 2006  /gen2/branches/1.0
> vlad       Sep 2006  /gen2/branches/ofed_fixes
> monil      Sep 2006  /gen2/branches/backport
> woody      Sep 2006  /gen2/branches/backport-to-2.6.9
> halr       May 2006  /gen2/branches/ibat

This can be removed.

-- Hal

> mst        Jul 2006  /gen2/branches/mellanox_fixes
> 
> KEEP the following:
> ===================
> 
> - /gen2/branches/1.1: by request (Tziporet)
> 
> REMOVE the following:
> =====================
> 
> In short, everything will be removed except what was listed above.   
> However, to be explicit, some more entries are listed below.
> 
> (*) entries mean "everything except what was already listed above"
> 
> Remove these trees based on the fact that they haven't changed in a  
> long time:
> 
> MTIME     PATH
> --------- ------------------------------
> Apr 2006  /trunk/contrib/*
> Apr 2006  /trunk/branches/*
> Apr 2006  /gen2/ulps
> Apr 2006  /gen2/branches/*
> Mar 2006  /gen2/users
> May 2005  /gen1
> Jan 2005  /gen2/trunk/arch
> Dec 2004  /gen2/utils
> Nov 2004  /gen2/trunk/scripts
> Jul 2004  /tags
> Apr 2004  /trunk/openib
> 
> Remove these trees for additional rationale:
> 
> - /branches: it's empty
> - /gen2/tags: replaced by OFED and git
> - /gen2/src: everything should now be in git (*** IS THIS RIGHT?!?!)
> 
> Comments?


From halr at voltaire.com  Fri Dec 29 09:05:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 12:05:00 -0500
Subject: [openib-general] [PATCH 0/4] OpenSM: Add optional SA MFTRecord
	support
Message-ID: <1167411898.29620.227395.camel@hal.voltaire.com>

OpenSM: Add optional SA MFTRecord support

This patch series adds support for the optional SA MFTRecord.

Signed-off-by: Hal Rosenstock <halr at voltaire.com>


From halr at voltaire.com  Fri Dec 29 09:07:15 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 12:07:15 -0500
Subject: [openib-general] [PATCH 1/4] OpenSM/osm_switch.h: Add some missing
 multicast table routines
Message-ID: <1167411902.29620.227397.camel@hal.voltaire.com>

OpenSM/osm_switch.h: Add some missing multicast table routines

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_switch.h b/osm/include/opensm/osm_switch.h
index 32fc547..71b3c8a 100644
--- a/osm/include/opensm/osm_switch.h
+++ b/osm/include/opensm/osm_switch.h
@@ -1054,6 +1054,122 @@ osm_switch_set_mft_block(
 * SEE ALSO
 *********/
 
+/****f* OpenSM: Switch/osm_switch_get_mft_block
+* NAME
+*	osm_switch_get_mft_block
+*
+* DESCRIPTION
+*	Retrieve a block of multicast port masks from the multicast table.
+*
+* SYNOPSIS
+*/
+static inline boolean_t
+osm_switch_get_mft_block(
+	IN osm_switch_t* const p_sw,
+	IN const uint16_t block_num,
+	IN const uint8_t position,
+	OUT ib_net16_t* const p_block )
+{
+	CL_ASSERT( p_sw );
+	return( osm_mcast_tbl_get_block( &p_sw->mcast_tbl,
+			block_num, position, p_block ) );
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to the switch object.
+*
+*	block_num
+*		[in] Block number (0-511) to set.
+*
+*	position
+*		[in] Port mask position (0-15) to set.
+*
+*	p_block
+*		[out] Pointer to the block of port masks stored.
+*
+* RETURN VALUES
+*	Returns true if there are more blocks necessary to 
+*	configure all the MLIDs reachable from this switch.
+*	FALSE otherwise.
+*
+* NOTES
+*
+* SEE ALSO
+*********/
+
+/****f* OpenSM: Switch/osm_switch_get_mft_max_block
+* NAME
+*	osm_switch_get_mft_max_block
+*
+* DESCRIPTION
+*       Get the max_block from the associated multicast table.
+*
+* SYNOPSIS
+*/
+static inline uint16_t
+osm_switch_get_mft_max_block(
+	IN osm_switch_t* const p_sw )
+{
+	CL_ASSERT( p_sw );
+	return( osm_mcast_tbl_get_max_block( &p_sw->mcast_tbl ) );
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to the switch object.
+*
+* RETURN VALUE
+*/
+
+/****f* OpenSM: Switch/osm_switch_get_mft_max_block_in_use
+* NAME
+*	osm_switch_get_mft_max_block_in_use
+*
+* DESCRIPTION
+*	Get the max_block_in_use from the associated multicast table.
+*
+* SYNOPSIS
+*/
+static inline uint16_t
+osm_switch_get_mft_max_block_in_use(
+	IN osm_switch_t* const p_sw )
+{
+	CL_ASSERT( p_sw );
+	return( osm_mcast_tbl_get_max_block_in_use( &p_sw->mcast_tbl ) );
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to the switch object.
+*
+* RETURN VALUE
+*/
+
+/****f* OpenSM: Switch/osm_switch_get_mft_max_position
+* NAME
+*	osm_switch_get_mft_max_position
+*
+* DESCRIPTION
+*       Get the max_position from the associated multicast table.
+*
+* SYNOPSIS
+*/
+static inline uint8_t
+osm_switch_get_mft_max_position(
+	IN osm_switch_t* const p_sw )
+{
+	CL_ASSERT( p_sw );
+	return( osm_mcast_tbl_get_max_position( &p_sw->mcast_tbl ) );
+}
+/*
+* PARAMETERS
+*	p_sw
+*		[in] Pointer to the switch object.
+*
+* RETURN VALUE
+*/
+
 /****f* OpenSM: Switch/osm_switch_recommend_path
 * NAME
 *	osm_switch_recommend_path


From halr at voltaire.com  Fri Dec 29 09:11:16 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 12:11:16 -0500
Subject: [openib-general] [PATCH 2/4} OpenSM: Add optional SA MFTRecord
	support
Message-ID: <1167412270.29620.227738.camel@hal.voltaire.com>

OpenSM: Add optional SA MFTRecord support

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/opensm/osm_sa_mft_record.h b/osm/include/opensm/osm_sa_mft_record.h
new file mode 100644
index 0000000..f961206
--- /dev/null
+++ b/osm/include/opensm/osm_sa_mft_record.h
@@ -0,0 +1,280 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * 	Declaration of osm_mftr_rcv_t.
+ *	This object represents the MulticastForwardingTable Receiver object.
+ *	attribute from a switch node.
+ *	This object is part of the OpenSM family of objects.
+ *
+ * Environment:
+ * 	Linux User Mode
+ *
+ */
+
+#ifndef _OSM_MFTR_H_
+#define _OSM_MFTR_H_
+
+#include <complib/cl_passivelock.h>
+#include <complib/cl_qlist.h>
+#include <opensm/osm_base.h>
+#include <opensm/osm_madw.h>
+#include <opensm/osm_sa_response.h>
+#include <opensm/osm_subnet.h>
+#include <opensm/osm_stats.h>
+#include <opensm/osm_log.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/****h* OpenSM/Multicast Forwarding Table Receiver
+* NAME
+*	Multicast Forwarding Table Receiver
+*
+* DESCRIPTION
+*	The Multicast Forwarding Table Receiver object encapsulates the information
+*	needed to receive the MulticastForwardingTable attribute from a switch node.
+*
+*	The Multicast Forwarding Table Receiver object is thread safe.
+*
+*	This object should be treated as opaque and should be
+*	manipulated only through the provided functions.
+*
+* AUTHOR
+*	Hal Rosenstock, Voltaire
+*
+*********/
+
+/****s* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_t
+* NAME
+*	osm_mftr_rcv_t
+*
+* DESCRIPTION
+*	Multicast Forwarding Table Receiver structure.
+*
+*	This object should be treated as opaque and should
+*	be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+typedef struct _osm_mft
+{
+  osm_subn_t*					p_subn;
+  osm_stats_t*					p_stats;
+  osm_sa_resp_t*				p_resp;
+  osm_mad_pool_t*				p_mad_pool;
+  osm_log_t*					p_log;
+  cl_plock_t*					p_lock;
+  cl_qlock_pool_t				pool;
+} osm_mftr_rcv_t;
+/*
+* FIELDS
+*	p_subn
+*		Pointer to the Subnet object for this subnet.
+*
+*	p_stats
+*		Pointer to the statistics.
+*
+*	p_resp
+*		Pointer to the SA responder.
+*
+*	p_mad_pool
+*		Pointer to the mad pool.
+*
+*	p_log
+*		Pointer to the log object.
+*
+*	p_lock
+*		Pointer to the serializing lock.
+*
+*	pool
+*		Pool of linkable Multicast Forwarding Table Record objects used to
+*               generate the query response.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receiver object
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_construct
+* NAME
+*	osm_mftr_rcv_construct
+*
+* DESCRIPTION
+*	This function constructs a Multicast Forwarding Table Receiver object.
+*
+* SYNOPSIS
+*/
+void osm_mftr_rcv_construct(
+	IN osm_mftr_rcv_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to a Multicast Forwarding Table Receiver object to construct.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Allows calling osm_mftr_rcv_init, osm_mftr_rcv_destroy
+*
+*	Calling osm_mftr_rcv_construct is a prerequisite to calling any other
+*	method except osm_mftr_rcv_init.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receiver object, osm_mftr_rcv_init, 
+*  osm_mftr_rcv_destroy
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_destroy
+* NAME
+*	osm_mftr_rcv_destroy
+*
+* DESCRIPTION
+*	The osm_mftr_rcv_destroy function destroys the object, releasing
+*	all resources.
+*
+* SYNOPSIS
+*/
+void osm_mftr_rcv_destroy(
+	IN osm_mftr_rcv_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to the object to destroy.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Performs any necessary cleanup of the specified
+*	Multicast Forwarding Table Receiver object.
+*	Further operations should not be attempted on the destroyed object.
+*	This function should only be called after a call to
+*	osm_mftr_rcv_construct or osm_mftr_rcv_init.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receiver object, osm_mftr_rcv_construct,
+*	osm_mftr_rcv_init
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_init
+* NAME
+*	osm_mftr_rcv_init
+*
+* DESCRIPTION
+*	The osm_mftr_rcv_init function initializes a
+*	Multicast Forwarding Table Receiver object for use.
+*
+* SYNOPSIS
+*/
+ib_api_status_t osm_mftr_rcv_init(
+	IN osm_mftr_rcv_t* const p_rcv,
+	IN osm_sa_resp_t* const p_resp,
+	IN osm_mad_pool_t* const p_mad_pool,
+	IN osm_subn_t* const p_subn,
+	IN osm_log_t* const p_log,
+	IN cl_plock_t* const p_lock );
+/*
+* PARAMETERS
+*	p_rcv
+*		[in] Pointer to an osm_mftr_rcv_t object to initialize.
+*
+*	p_req
+*		[in] Pointer to an osm_req_t object.
+*
+*	p_subn
+*		[in] Pointer to the Subnet object for this subnet.
+*
+*	p_log
+*		[in] Pointer to the log object.
+*
+*	p_lock
+*		[in] Pointer to the OpenSM serializing lock.
+*
+* RETURN VALUES
+*	CL_SUCCESS if the Multicast Forwarding Table Receiver object was initialized
+*	successfully.
+*
+* NOTES
+*	Allows calling other Multicast Forwarding Table Receiver methods.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receiver object, osm_mftr_rcv_construct, 
+*  osm_mftr_rcv_destroy
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receiver/osm_mftr_rcv_process
+* NAME
+*	osm_mftr_rcv_process
+*
+* DESCRIPTION
+*	Process the MulticastForwardingTable attribute.
+*
+* SYNOPSIS
+*/
+void osm_mftr_rcv_process(
+	IN osm_mftr_rcv_t* const p_ctrl,
+	IN const osm_madw_t* const p_madw );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_mftr_rcv_t object.
+*
+*	p_madw
+*		[in] Pointer to the MAD Wrapper containing the MAD
+*		that contains the switch node's MulticastForwardingTable attribute.
+*
+* RETURN VALUES
+*	CL_SUCCESS if the MulticastForwardingTable processing was successful.
+*
+* NOTES
+*	This function processes a MulticastForwardingTable attribute.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receiver, Multicast Forwarding Table Response
+*  Controller
+*********/
+
+END_C_DECLS
+
+#endif	/* _OSM_MFTR_H_ */
diff --git a/osm/include/opensm/osm_sa_mft_record_ctrl.h b/osm/include/opensm/osm_sa_mft_record_ctrl.h
new file mode 100644
index 0000000..a28374d
--- /dev/null
+++ b/osm/include/opensm/osm_sa_mft_record_ctrl.h
@@ -0,0 +1,231 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ * 	Declaration of osm_mftr_rcv_ctrl_t.
+ *	This object represents a controller that receives the IBA
+ *	MulticastForwardingTable attribute from a switch.
+ *	This object is part of the OpenSM family of objects.
+ *
+ * Environment:
+ * 	Linux User Mode
+ *
+ */
+
+#ifndef _OSM_MFTR_RCV_CTRL_H_
+#define _OSM_MFTR_RCV_CTRL_H_
+
+#include <complib/cl_dispatcher.h>
+#include <opensm/osm_base.h>
+#include <opensm/osm_madw.h>
+#include <opensm/osm_log.h>
+#include <opensm/osm_sa_mft_record.h>
+
+#ifdef __cplusplus
+#  define BEGIN_C_DECLS extern "C" {
+#  define END_C_DECLS   }
+#else /* !__cplusplus */
+#  define BEGIN_C_DECLS
+#  define END_C_DECLS
+#endif /* __cplusplus */
+
+BEGIN_C_DECLS
+
+/****h* OpenSM/Multicast Forwarding Table Receive Controller
+* NAME
+*	Multicast Forwarding Table Record Receive Controller
+*
+* DESCRIPTION
+*	The Multicast Forwarding Table Receive Controller object encapsulates
+*	the information needed to receive the MulticastFowardingTable attribute
+*  from a switch node.
+*
+*	The Multicast Forwarding Table Receive Controller object is thread safe.
+*
+*	This object should be treated as opaque and should be
+*	manipulated only through the provided functions.
+*
+* AUTHOR
+*	Hal Rosenstock, Voltaire
+*
+*********/
+
+/****s* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_t
+* NAME
+*	osm_mftr_rcv_ctrl_t
+*
+* DESCRIPTION
+*	Multicast Forwarding Table Receive Controller structure.
+*
+*	This object should be treated as opaque and should
+*	be manipulated only through the provided functions.
+*
+* SYNOPSIS
+*/
+typedef struct _osm_mftr_rcv_ctrl
+{
+	osm_mftr_rcv_t			*p_rcv;
+	osm_log_t			*p_log;
+	cl_dispatcher_t			*p_disp;
+	cl_disp_reg_handle_t		h_disp;
+} osm_mftr_rcv_ctrl_t;
+/*
+* FIELDS
+*	p_rcv
+*		Pointer to the Multicast Forwarding Table Receiver object.
+*
+*	p_log
+*		Pointer to the log object.
+*
+*	p_disp
+*		Pointer to the Dispatcher.
+*
+*	h_disp
+*		Handle returned from dispatcher registration.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receive Controller object
+*	Multicast Forwarding Table Receiver object
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_construct
+* NAME
+*	osm_mftr_rcv_ctrl_construct
+*
+* DESCRIPTION
+*	This function constructs a Multicast Forwarding Table Receive
+*  Controller object.
+*
+* SYNOPSIS
+*/
+void osm_mftr_rcv_ctrl_construct(
+	IN osm_mftr_rcv_ctrl_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to a Multicast Forwarding Table Receive Controller
+*		object to construct.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Allows calling osm_mftr_rcv_ctrl_init, osm_mftr_rcv_ctrl_destroy
+*
+*	Calling osm_mftr_rcv_ctrl_construct is a prerequisite to calling any other
+*	method except osm_mftr_rcv_ctrl_init.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receive Controller object, osm_mftr_rcv_ctrl_init,
+*	osm_mftr_rcv_ctrl_destroy
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_destroy
+* NAME
+*	osm_mftr_rcv_ctrl_destroy
+*
+* DESCRIPTION
+*	The osm_mftr_rcv_ctrl_destroy function destroys the object, releasing
+*	all resources.
+*
+* SYNOPSIS
+*/
+void osm_mftr_rcv_ctrl_destroy(
+	IN osm_mftr_rcv_ctrl_t* const p_ctrl );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to the object to destroy.
+*
+* RETURN VALUE
+*	This function does not return a value.
+*
+* NOTES
+*	Performs any necessary cleanup of the specified
+*	Multicast Forwarding Table Receive Controller object.
+*	Further operations should not be attempted on the destroyed object.
+*	This function should only be called after a call to
+*	osm_mftr_rcv_ctrl_construct or osm_mftr_rcv_ctrl_init.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receive Controller object, osm_mftr_rcv_ctrl_construct,
+*	osm_mftr_rcv_ctrl_init
+*********/
+
+/****f* OpenSM: Multicast Forwarding Table Receive Controller/osm_mftr_rcv_ctrl_init
+* NAME
+*	osm_mftr_rcv_ctrl_init
+*
+* DESCRIPTION
+*	The osm_mftr_rcv_ctrl_init function initializes a
+*	Multicast Forwarding Table Receive Controller object for use.
+*
+* SYNOPSIS
+*/
+ib_api_status_t osm_mftr_rcv_ctrl_init(
+	IN osm_mftr_rcv_ctrl_t* const p_ctrl,
+	IN osm_mftr_rcv_t* const p_rcv,
+	IN osm_log_t* const p_log,
+	IN cl_dispatcher_t* const p_disp );
+/*
+* PARAMETERS
+*	p_ctrl
+*		[in] Pointer to an osm_mftr_rcv_ctrl_t object to initialize.
+*
+*	p_rcv
+*		[in] Pointer to an osm_mftr_t object.
+*
+*	p_log
+*		[in] Pointer to the log object.
+*
+*	p_disp
+*		[in] Pointer to the OpenSM central Dispatcher.
+*
+* RETURN VALUES
+*	CL_SUCCESS if the Multicast Forwarding Table Receive Controller object
+*  was initialized successfully.
+*
+* NOTES
+*	Allows calling other Multicast Forwarding Table Receive Controller methods.
+*
+* SEE ALSO
+*	Multicast Forwarding Table Receive Controller object, 
+*  osm_mftr_rcv_ctrl_construct, osm_mftr_rcv_ctrl_destroy
+*********/
+
+END_C_DECLS
+
+#endif	/* _OSM_MFTR_RCV_CTRL_H_ */
diff --git a/osm/opensm/osm_sa_mft_record.c b/osm/opensm/osm_sa_mft_record.c
new file mode 100644
index 0000000..a415fb9
--- /dev/null
+++ b/osm/opensm/osm_sa_mft_record.c
@@ -0,0 +1,540 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of osm_mftr_rcv_t.
+ *   This object represents the MulticastForwardingTable Receiver object.
+ *   This object is part of the opensm family of objects.
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <string.h>
+#include <iba/ib_types.h>
+#include <complib/cl_debug.h>
+#include <complib/cl_qlist.h>
+#include <opensm/osm_sa_mft_record.h>
+#include <opensm/osm_switch.h>
+#include <vendor/osm_vendor_api.h>
+#include <opensm/osm_helper.h>
+#include <opensm/osm_pkey.h>
+
+#define OSM_MFTR_RCV_POOL_MIN_SIZE      32
+#define OSM_MFTR_RCV_POOL_GROW_SIZE     32
+
+typedef   struct _osm_mftr_item
+{
+  cl_pool_item_t          pool_item;
+  ib_mft_record_t         rec;
+} osm_mftr_item_t;
+
+typedef   struct _osm_mftr_search_ctxt
+{
+  const ib_mft_record_t*      p_rcvd_rec;
+  ib_net64_t                  comp_mask;
+  cl_qlist_t*                 p_list;
+  osm_mftr_rcv_t*             p_rcv;
+  const osm_physp_t*          p_req_physp;
+} osm_mftr_search_ctxt_t;
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_mftr_rcv_construct(
+  IN osm_mftr_rcv_t* const p_rcv )
+{
+  memset( p_rcv, 0, sizeof(*p_rcv) );
+  cl_qlock_pool_construct( &p_rcv->pool );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_mftr_rcv_destroy(
+  IN osm_mftr_rcv_t* const p_rcv )
+{
+  OSM_LOG_ENTER( p_rcv->p_log, osm_mftr_rcv_destroy );
+  cl_qlock_pool_destroy( &p_rcv->pool );
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_mftr_rcv_init(
+  IN osm_mftr_rcv_t* const p_rcv,
+  IN osm_sa_resp_t*  const p_resp,
+  IN osm_mad_pool_t* const p_mad_pool,
+  IN osm_subn_t*     const p_subn,
+  IN osm_log_t*      const p_log,
+  IN cl_plock_t*     const p_lock )
+{
+  ib_api_status_t            status;
+
+  OSM_LOG_ENTER( p_log, osm_mftr_rcv_init );
+
+  osm_mftr_rcv_construct( p_rcv );
+
+  p_rcv->p_log = p_log;
+  p_rcv->p_subn = p_subn;
+  p_rcv->p_lock = p_lock;
+  p_rcv->p_resp = p_resp;
+  p_rcv->p_mad_pool = p_mad_pool;
+
+  status = cl_qlock_pool_init( &p_rcv->pool,
+                               OSM_MFTR_RCV_POOL_MIN_SIZE,
+                               0,
+                               OSM_MFTR_RCV_POOL_GROW_SIZE,
+                               sizeof(osm_mftr_item_t),
+                               NULL, NULL, NULL );
+
+  OSM_LOG_EXIT( p_log );
+  return( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static ib_api_status_t
+__osm_mftr_rcv_new_mftr(
+  IN osm_mftr_rcv_t*         const p_rcv,
+  IN osm_switch_t*           const p_sw,
+  IN cl_qlist_t*             const p_list,
+  IN ib_net16_t              const lid,
+  IN uint16_t                const block,
+  IN uint8_t                 const position )
+{
+  osm_mftr_item_t*           p_rec_item;
+  ib_api_status_t            status = IB_SUCCESS;
+  uint16_t                   position_block_num;
+
+  OSM_LOG_ENTER( p_rcv->p_log, __osm_mftr_rcv_new_mftr );
+
+  p_rec_item = (osm_mftr_item_t*)cl_qlock_pool_get( &p_rcv->pool );
+  if( p_rec_item == NULL )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_mftr_rcv_new_mftr: ERR 4A02: "
+             "cl_qlock_pool_get failed\n" );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
+  if( osm_log_is_active( p_rcv->p_log, OSM_LOG_DEBUG ) )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_mftr_rcv_new_mftr: "
+             "New MulticastForwardingTable: sw 0x%016" PRIx64
+             "\n\t\t\t\tblock %u position %u lid 0x%02X\n",
+             cl_ntoh64( osm_node_get_node_guid( p_sw->p_node ) ),
+             block, position, cl_ntoh16( lid )
+             );
+  }
+
+  position_block_num = ((uint16_t)position << 12) |
+			(block & IB_MCAST_BLOCK_ID_MASK_HO);
+
+  memset( &p_rec_item->rec, 0, sizeof(ib_mft_record_t) );
+
+  p_rec_item->rec.lid = lid;
+  p_rec_item->rec.position_block_num = cl_hton16( position_block_num );
+
+  /* copy the mft block */
+  osm_switch_get_mft_block( p_sw, block, position, p_rec_item->rec.mft );
+
+  cl_qlist_insert_tail( p_list, (cl_list_item_t*)&p_rec_item->pool_item );
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+  return( status );
+}
+
+/**********************************************************************
+ **********************************************************************/
+static osm_port_t*
+__osm_mftr_get_port_by_guid(
+  IN osm_mftr_rcv_t*  const p_rcv,
+  IN uint64_t         port_guid )
+{
+  osm_port_t*         p_port;
+
+  CL_PLOCK_ACQUIRE(p_rcv->p_lock);
+
+  p_port = (osm_port_t *)cl_qmap_get(&p_rcv->p_subn->port_guid_tbl,
+                                     port_guid);
+  if (p_port == (osm_port_t *)cl_qmap_end(&p_rcv->p_subn->port_guid_tbl))
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_mftr_get_port_by_guid ERR 4A04: "
+             "Invalid port GUID 0x%016" PRIx64 "\n",
+             port_guid );
+    p_port = NULL;
+  }
+
+  CL_PLOCK_RELEASE(p_rcv->p_lock);
+  return p_port;
+}
+
+/**********************************************************************
+ **********************************************************************/
+static void
+__osm_mftr_rcv_by_comp_mask(
+  IN cl_map_item_t*         const p_map_item,
+  IN void*                  context )
+{
+  const osm_mftr_search_ctxt_t* const p_ctxt =
+    (osm_mftr_search_ctxt_t *)context;
+  osm_switch_t*             const p_sw = (osm_switch_t*)p_map_item;
+  const ib_mft_record_t*    const p_rcvd_rec = p_ctxt->p_rcvd_rec;
+  osm_mftr_rcv_t*           const p_rcv = p_ctxt->p_rcv;
+  ib_net64_t                const comp_mask = p_ctxt->comp_mask;
+  const osm_physp_t*        const p_req_physp = p_ctxt->p_req_physp;
+  osm_port_t*               p_port;
+  uint16_t                  min_lid_ho, max_lid_ho;
+  uint16_t                  position_block_num_ho;
+  uint16_t                  min_block, max_block, block;
+  const osm_physp_t*        p_physp;
+  uint8_t                   min_position, max_position, position;
+
+  /* In switches, the port guid is the node guid. */
+  p_port =
+    __osm_mftr_get_port_by_guid( p_rcv, p_sw->p_node->node_info.port_guid );
+  if (! p_port)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_mftr_rcv_by_comp_mask: ERR 4A05: "
+             "Failed to find Port by Node Guid:0x%016" PRIx64
+             "\n",
+             cl_ntoh64( p_sw->p_node->node_info.node_guid )
+             );
+    return;
+  }
+
+  /* check that the requester physp and the current physp are under
+     the same partition. */
+  p_physp = osm_port_get_default_phys_ptr( p_port );
+  if (! p_physp)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "__osm_mftr_rcv_by_comp_mask: ERR 4A06: "
+             "Failed to find default physical Port by Node Guid:0x%016" PRIx64
+             "\n",
+             cl_ntoh64( p_sw->p_node->node_info.node_guid )
+             );
+    return;
+  }
+  if (! osm_physp_share_pkey( p_rcv->p_log, p_req_physp, p_physp ))
+    return;
+
+  /* get the port 0 of the switch */
+  osm_port_get_lid_range_ho( p_port, &min_lid_ho, &max_lid_ho );
+
+  /* compare the lids - if required */
+  if( comp_mask & IB_MFTR_COMPMASK_LID )
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+             "__osm_mftr_rcv_by_comp_mask: "
+             "Comparing lid:0x%02X to port lid range: 0x%02X .. 0x%02X\n",
+             cl_ntoh16( p_rcvd_rec->lid ), min_lid_ho, max_lid_ho
+             );
+    /* ok we are ready for range check */
+    if (min_lid_ho > cl_ntoh16(p_rcvd_rec->lid) ||
+        max_lid_ho < cl_ntoh16(p_rcvd_rec->lid))
+      return;
+  }
+
+  position_block_num_ho = cl_ntoh16(p_rcvd_rec->position_block_num);
+
+  /* now we need to decide which blocks to output */
+  if( comp_mask & IB_MFTR_COMPMASK_BLOCK )
+  {
+    max_block = min_block = position_block_num_ho & IB_MCAST_BLOCK_ID_MASK_HO;
+    if (max_block > osm_switch_get_mft_max_block_in_use( p_sw ) )
+      return;
+  }
+  else
+  {
+    /* use as many blocks as needed */
+    min_block = 0;
+    max_block = osm_switch_get_mft_max_block_in_use( p_sw );
+  }
+
+  /* need to decide which positions to output */
+  if ( comp_mask & IB_MFTR_COMPMASK_POSITION )
+  {
+    min_position = max_position = (position_block_num_ho & 0xF000) >> 12;
+    if (max_position > osm_switch_get_mft_max_position( p_sw ) )
+      return; 
+  }
+  else
+  {
+    /* use as many positions as needed */
+    min_position = 0;
+    max_position = osm_switch_get_mft_max_position( p_sw );
+  }
+
+  /* so we can add these one by one ... */
+  for (block = min_block; block <= max_block; block++)
+    for (position = min_position; position <= max_position; position++)
+      __osm_mftr_rcv_new_mftr( p_rcv, p_sw, p_ctxt->p_list,
+                               osm_port_get_base_lid(p_port),
+                               block, position );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_mftr_rcv_process(
+  IN osm_mftr_rcv_t*        const p_rcv,
+  IN const osm_madw_t*      const p_madw )
+{
+  const ib_sa_mad_t*        p_rcvd_mad;
+  const ib_mft_record_t*    p_rcvd_rec;
+  ib_mft_record_t*          p_resp_rec;
+  cl_qlist_t                rec_list;
+  osm_madw_t*               p_resp_madw;
+  ib_sa_mad_t*              p_resp_sa_mad;
+  uint32_t                  num_rec, pre_trim_num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  uint32_t                  trim_num_rec;
+#endif
+  uint32_t                  i;
+  osm_mftr_search_ctxt_t    context;
+  osm_mftr_item_t*          p_rec_item;
+  ib_api_status_t           status = IB_SUCCESS;
+  osm_physp_t*              p_req_physp;
+
+  CL_ASSERT( p_rcv );
+
+  OSM_LOG_ENTER( p_rcv->p_log, osm_mftr_rcv_process );
+
+  CL_ASSERT( p_madw );
+
+  p_rcvd_mad = osm_madw_get_sa_mad_ptr( p_madw );
+  p_rcvd_rec = (ib_mft_record_t*)ib_sa_mad_get_payload_ptr( p_rcvd_mad );
+
+  CL_ASSERT( p_rcvd_mad->attr_id == IB_MAD_ATTR_MFT_RECORD );
+
+  /* we only support SubnAdmGet and SubnAdmGetTable methods */
+  if ( (p_rcvd_mad->method != IB_MAD_METHOD_GET) &&
+       (p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) ) {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_mftr_rcv_process: ERR 4A08: "
+             "Unsupported Method (%s)\n",
+             ib_get_sa_method_str( p_rcvd_mad->method ) );
+    osm_sa_send_error( p_rcv->p_resp, p_madw, IB_MAD_STATUS_UNSUP_METHOD_ATTR );
+    goto Exit;
+  }
+
+  /* update the requester physical port. */
+  p_req_physp = osm_get_physp_by_mad_addr(p_rcv->p_log,
+                                          p_rcv->p_subn,
+                                          osm_madw_get_mad_addr_ptr(p_madw) );
+  if (p_req_physp == NULL)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+             "osm_mftr_rcv_process: ERR 4A07: "
+             "Cannot find requester physical port\n" );
+    goto Exit;
+  }
+
+  cl_qlist_init( &rec_list );
+
+  context.p_rcvd_rec     = p_rcvd_rec;
+  context.p_list         = &rec_list;
+  context.comp_mask      = p_rcvd_mad->comp_mask;
+  context.p_rcv          = p_rcv;
+  context.p_req_physp    = p_req_physp;
+
+  cl_plock_acquire( p_rcv->p_lock );
+
+  /* Go over all switches */
+  cl_qmap_apply_func( &p_rcv->p_subn->sw_guid_tbl,
+                      __osm_mftr_rcv_by_comp_mask,
+                      &context );
+
+  cl_plock_release( p_rcv->p_lock );
+
+  num_rec = cl_qlist_count( &rec_list );
+
+  /*
+   * C15-0.1.30:
+   * If we do a SubnAdmGet and got more than one record it is an error !
+   */
+  if (p_rcvd_mad->method == IB_MAD_METHOD_GET)
+  {
+    if (num_rec == 0)
+    {
+      osm_sa_send_error( p_rcv->p_resp, p_madw, IB_SA_MAD_STATUS_NO_RECORDS );
+      goto Exit;
+    }
+    if (num_rec > 1)
+    {
+      osm_log( p_rcv->p_log, OSM_LOG_ERROR,
+               "osm_mftr_rcv_process: ERR 4A09: "
+               "Got more than one record for SubnAdmGet (%u)\n",
+               num_rec );
+      osm_sa_send_error( p_rcv->p_resp, p_madw,
+                         IB_SA_MAD_STATUS_TOO_MANY_RECORDS);
+
+      /* need to set the mem free ... */
+      p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list );
+      while( p_rec_item != (osm_mftr_item_t*)cl_qlist_end( &rec_list ) )
+      {
+        cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+        p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list );
+      }
+
+      goto Exit;
+    }
+  }
+
+  pre_trim_num_rec = num_rec;
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we limit the number of records to a single packet */
+  trim_num_rec = (MAD_BLOCK_SIZE - IB_SA_MAD_HDR_SIZE) / sizeof(ib_mft_record_t);
+  if (trim_num_rec < num_rec)
+  {
+    osm_log( p_rcv->p_log, OSM_LOG_VERBOSE,
+             "osm_mftr_rcv_process: "
+             "Number of records:%u trimmed to:%u to fit in one MAD\n",
+             num_rec, trim_num_rec );
+    num_rec = trim_num_rec;
+  }
+#endif
+
+  osm_log( p_rcv->p_log, OSM_LOG_DEBUG,
+           "osm_mftr_rcv_process: "
+           "Returning %u records\n", num_rec );
+
+  if ((p_rcvd_mad->method != IB_MAD_METHOD_GETTABLE) &&
+      (num_rec == 0))
+  {
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_NO_RECORDS );
+    goto Exit;
+  }
+
+  /* 
+   * Get a MAD to reply. Address of Mad is in the received mad_wrapper
+   */
+  p_resp_madw = osm_mad_pool_get( p_rcv->p_mad_pool,
+                                  p_madw->h_bind,
+                                  num_rec * sizeof(ib_mft_record_t) + IB_SA_MAD_HDR_SIZE,
+                                  &p_madw->mad_addr );
+
+  if( !p_resp_madw )
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_mftr_rcv_process: ERR 4A10: "
+            "osm_mad_pool_get failed\n" );
+
+    for( i = 0; i < num_rec; i++ )
+    {
+      p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list );
+      cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    }
+
+    osm_sa_send_error( p_rcv->p_resp, p_madw,
+                       IB_SA_MAD_STATUS_NO_RESOURCES );
+
+    goto Exit;
+  }
+
+  p_resp_sa_mad = osm_madw_get_sa_mad_ptr( p_resp_madw );
+
+  /*
+    Copy the MAD header back into the response mad.
+    Set the 'R' bit and the payload length,
+    Then copy all records from the list into the response payload.
+  */
+
+  memcpy( p_resp_sa_mad, p_rcvd_mad, IB_SA_MAD_HDR_SIZE );
+  p_resp_sa_mad->method |= IB_MAD_METHOD_RESP_MASK;
+  /* C15-0.1.5 - always return SM_Key = 0 (table 185 p 884) */
+  p_resp_sa_mad->sm_key = 0;
+  /* Fill in the offset (paylen will be done by the rmpp SAR) */
+  p_resp_sa_mad->attr_offset =
+    ib_get_attr_offset( sizeof(ib_mft_record_t) );
+
+  p_resp_rec = (ib_mft_record_t*)ib_sa_mad_get_payload_ptr( p_resp_sa_mad );
+
+#ifndef VENDOR_RMPP_SUPPORT
+  /* we support only one packet RMPP - so we will set the first and
+     last flags for gettable */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+  {
+    p_resp_sa_mad->rmpp_type = IB_RMPP_TYPE_DATA;
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_FIRST | IB_RMPP_FLAG_LAST | IB_RMPP_FLAG_ACTIVE;
+  }
+#else
+  /* forcefully define the packet as RMPP one */
+  if (p_resp_sa_mad->method == IB_MAD_METHOD_GETTABLE_RESP)
+    p_resp_sa_mad->rmpp_flags = IB_RMPP_FLAG_ACTIVE;
+#endif
+
+  for( i = 0; i < pre_trim_num_rec; i++ )
+  {
+    p_rec_item = (osm_mftr_item_t*)cl_qlist_remove_head( &rec_list );
+    /* copy only if not trimmed */
+    if (i < num_rec)
+    {
+      *p_resp_rec = p_rec_item->rec;
+    }
+    cl_qlock_pool_put( &p_rcv->pool, &p_rec_item->pool_item );
+    p_resp_rec++;
+  }
+
+  CL_ASSERT( cl_is_qlist_empty( &rec_list ) );
+
+  status = osm_vendor_send( p_resp_madw->h_bind, p_resp_madw, FALSE );
+  if (status != IB_SUCCESS)
+  {
+    osm_log(p_rcv->p_log, OSM_LOG_ERROR,
+            "osm_mftr_rcv_process: ERR 4A11: "
+            "osm_vendor_send status = %s\n",
+            ib_get_err_str(status));
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( p_rcv->p_log );
+}
diff --git a/osm/opensm/osm_sa_mft_record_ctrl.c b/osm/opensm/osm_sa_mft_record_ctrl.c
new file mode 100644
index 0000000..cf433a9
--- /dev/null
+++ b/osm/opensm/osm_sa_mft_record_ctrl.c
@@ -0,0 +1,123 @@
+/*
+ * Copyright (c) 2004-2006 Voltaire, Inc. All rights reserved.
+ * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
+ * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+/*
+ * Abstract:
+ *    Implementation of osm_mftr_rcv_ctrl_t.
+ * This object represents the MulticastForwardingTable request controller object.
+ * This object is part of the opensm family of objects.
+ *
+ * Environment:
+ *    Linux User Mode
+ *
+ */
+
+#if HAVE_CONFIG_H
+#  include <config.h>
+#endif /* HAVE_CONFIG_H */
+
+#include <string.h>
+#include <opensm/osm_sa_mft_record_ctrl.h>
+#include <opensm/osm_msgdef.h>
+
+/**********************************************************************
+ **********************************************************************/
+void
+__osm_mftr_rcv_ctrl_disp_callback(
+  IN  void *context,
+  IN  void *p_data )
+{
+  /* ignore return status when invoked via the dispatcher */
+  osm_mftr_rcv_process( ((osm_mftr_rcv_ctrl_t*)context)->p_rcv,
+                        (osm_madw_t*)p_data );
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_mftr_rcv_ctrl_construct(
+  IN osm_mftr_rcv_ctrl_t* const p_ctrl )
+{
+  memset( p_ctrl, 0, sizeof(*p_ctrl) );
+  p_ctrl->h_disp = CL_DISP_INVALID_HANDLE;
+}
+
+/**********************************************************************
+ **********************************************************************/
+void
+osm_mftr_rcv_ctrl_destroy(
+  IN osm_mftr_rcv_ctrl_t* const p_ctrl )
+{
+  CL_ASSERT( p_ctrl );
+  cl_disp_unregister( p_ctrl->h_disp );
+}
+
+/**********************************************************************
+ **********************************************************************/
+ib_api_status_t
+osm_mftr_rcv_ctrl_init(
+  IN osm_mftr_rcv_ctrl_t* const p_ctrl,
+  IN osm_mftr_rcv_t* const p_rcv,
+  IN osm_log_t* const p_log,
+  IN cl_dispatcher_t* const p_disp )
+{
+  ib_api_status_t status = IB_SUCCESS;
+
+  OSM_LOG_ENTER( p_log, osm_mftr_rcv_ctrl_init );
+
+  osm_mftr_rcv_ctrl_construct( p_ctrl );
+  p_ctrl->p_log = p_log;
+  p_ctrl->p_rcv = p_rcv;
+  p_ctrl->p_disp = p_disp;
+
+  p_ctrl->h_disp = cl_disp_register(
+    p_disp,
+    OSM_MSG_MAD_MFT_RECORD,
+    __osm_mftr_rcv_ctrl_disp_callback,
+    p_ctrl );
+
+  if( p_ctrl->h_disp == CL_DISP_INVALID_HANDLE )
+  {
+    osm_log( p_log, OSM_LOG_ERROR,
+             "osm_mftr_rcv_ctrl_init: ERR 4A01: "
+             "Dispatcher registration failed\n" );
+    status = IB_INSUFFICIENT_RESOURCES;
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( p_log );
+  return( status );
+}


From halr at voltaire.com  Fri Dec 29 09:12:22 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 12:12:22 -0500
Subject: [openib-general] [PATCH 3/4] OpenSM: Other changes to incorporate
 optional SA MFTRecord support
Message-ID: <1167412341.29620.227807.camel@hal.voltaire.com>

OpenSM: Other changes to incorporate optional SA MFTRecord support

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/include/Makefile.am b/osm/include/Makefile.am
index d051b9a..ea8ab10 100644
--- a/osm/include/Makefile.am
+++ b/osm/include/Makefile.am
@@ -28,6 +28,7 @@ EXTRA_DIST = \
 	$(srcdir)/opensm/osm_sa_service_record_ctrl.h \
 	$(srcdir)/opensm/osm_pkey_rcv_ctrl.h \
 	$(srcdir)/opensm/osm_sa_lft_record.h \
+	$(srcdir)/opensm/osm_sa_mft_record.h \
 	$(srcdir)/opensm/osm_resp.h \
 	$(srcdir)/opensm/osm_partition.h \
 	$(srcdir)/opensm/osm_slvl_map_rcv_ctrl.h \
@@ -47,6 +48,7 @@ EXTRA_DIST = \
 	$(srcdir)/opensm/osm_sminfo_rcv_ctrl.h \
 	$(srcdir)/opensm/osm_sa_pkey_record.h \
 	$(srcdir)/opensm/osm_sa_lft_record_ctrl.h \
+	$(srcdir)/opensm/osm_sa_mft_record_ctrl.h \
 	$(srcdir)/opensm/osm_inform.h \
 	$(srcdir)/opensm/osm_path.h \
 	$(srcdir)/opensm/osm_lin_fwd_rcv.h \
diff --git a/osm/include/opensm/osm_msgdef.h b/osm/include/opensm/osm_msgdef.h
index 3611025..87c943f 100644
--- a/osm/include/opensm/osm_msgdef.h
+++ b/osm/include/opensm/osm_msgdef.h
@@ -196,6 +196,7 @@ enum
 	OSM_MSG_MAD_GUIDINFO_RECORD,
 	OSM_MSG_MAD_INFORM_INFO_RECORD,
 	OSM_MSG_MAD_SWITCH_INFO_RECORD,
+	OSM_MSG_MAD_MFT_RECORD,
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
 	OSM_MSG_MAD_MULTIPATH_RECORD,
 #endif
diff --git a/osm/include/opensm/osm_sa.h b/osm/include/opensm/osm_sa.h
index ae8d5ac..1508f44 100644
--- a/osm/include/opensm/osm_sa.h
+++ b/osm/include/opensm/osm_sa.h
@@ -77,6 +77,7 @@
 #include <opensm/osm_sa_pkey_record_ctrl.h>
 #include <opensm/osm_sa_lft_record_ctrl.h>
 #include <opensm/osm_sa_sw_info_record_ctrl.h>
+#include <opensm/osm_sa_mft_record_ctrl.h>
 
 #ifdef __cplusplus
 #  define BEGIN_C_DECLS extern "C" {
@@ -195,6 +196,10 @@ typedef struct _osm_sa
 	/* SwitchInfo Query */
 	osm_sir_rcv_t				sir_rcv;
 	osm_sir_rcv_ctrl_t			sir_rcv_ctrl;
+
+	/* MulticastForwardingTable Query */
+	osm_mftr_rcv_t				mftr_rcv;
+	osm_mftr_rcv_ctrl_t			mftr_rcv_ctrl;
 } osm_sa_t;
 /*
 * FIELDS
diff --git a/osm/opensm/Makefile.am b/osm/opensm/Makefile.am
index aed60d7..8f42387 100644
--- a/osm/opensm/Makefile.am
+++ b/osm/opensm/Makefile.am
@@ -43,7 +43,8 @@ opensm_SOURCES = main.c osm_console.c os
 		 osm_resp.c osm_sa.c osm_sa_class_port_info.c \
 		 osm_sa_class_port_info_ctrl.c osm_sa_informinfo.c \
 		 osm_sa_informinfo_ctrl.c osm_sa_lft_record.c \
-		 osm_sa_lft_record_ctrl.c osm_sa_link_record.c \
+		 osm_sa_lft_record_ctrl.c osm_sa_mft_record.c \
+		 osm_sa_mft_record_ctrl.c osm_sa_link_record.c \
 		 osm_sa_link_record_ctrl.c osm_sa_mad_ctrl.c \
 		 osm_sa_mcmember_record.c osm_sa_mcmember_record_ctrl.c \
 		 osm_sa_node_record.c osm_sa_node_record_ctrl.c \
diff --git a/osm/opensm/osm_sa.c b/osm/opensm/osm_sa.c
index 983d5e5..7a993f1 100644
--- a/osm/opensm/osm_sa.c
+++ b/osm/opensm/osm_sa.c
@@ -131,6 +131,9 @@ osm_sa_construct(
 
   osm_sir_rcv_construct( &p_sa->sir_rcv );
   osm_sir_rcv_ctrl_construct( &p_sa->sir_rcv_ctrl );
+
+  osm_mftr_rcv_construct( &p_sa->mftr_rcv );
+  osm_mftr_rcv_ctrl_construct( &p_sa->mftr_rcv_ctrl );
 }
 
 /**********************************************************************
@@ -163,6 +166,7 @@ osm_sa_shutdown(
   osm_pkey_rec_rcv_ctrl_destroy( &p_sa->pkey_rec_rcv_ctrl );
   osm_lftr_rcv_ctrl_destroy( &p_sa->lftr_rcv_ctrl );
   osm_sir_rcv_ctrl_destroy( &p_sa->sir_rcv_ctrl );
+  osm_mftr_rcv_ctrl_destroy( &p_sa->mftr_rcv_ctrl );
   osm_sa_mad_ctrl_destroy( &p_sa->mad_ctrl );
 
   OSM_LOG_EXIT( p_sa->p_log );
@@ -195,6 +199,7 @@ osm_sa_destroy(
   osm_pkey_rec_rcv_destroy( &p_sa->pkey_rec_rcv );
   osm_lftr_rcv_destroy( &p_sa->lftr_rcv );
   osm_sir_rcv_destroy( &p_sa->sir_rcv );
+  osm_mftr_rcv_destroy( &p_sa->mftr_rcv );
   osm_sa_resp_destroy( &p_sa->resp );
 
   OSM_LOG_EXIT( p_sa->p_log );
@@ -537,6 +542,24 @@ osm_sa_init(
   if( status != IB_SUCCESS )
     goto Exit;
 
+  status = osm_mftr_rcv_init(
+    &p_sa->mftr_rcv,
+    &p_sa->resp,
+    p_sa->p_mad_pool,
+    p_subn,
+    p_log,
+    p_lock);
+  if( status != IB_SUCCESS )
+    goto Exit;
+
+  status = osm_mftr_rcv_ctrl_init(
+    &p_sa->mftr_rcv_ctrl,
+    &p_sa->mftr_rcv,
+    p_log,
+    p_disp );
+  if( status != IB_SUCCESS )
+    goto Exit;
+
  Exit:
   OSM_LOG_EXIT( p_log );
   return( status );
diff --git a/osm/opensm/osm_sa_class_port_info.c b/osm/opensm/osm_sa_class_port_info.c
index 4d7bcbb..84fa016 100644
--- a/osm/opensm/osm_sa_class_port_info.c
+++ b/osm/opensm/osm_sa_class_port_info.c
@@ -195,7 +195,6 @@ __osm_cpi_rcv_respond(
   /* we do not support the following optional records:
      OSM_CAP_IS_SUBN_OPT_RECS_SUP :
      RandomForwardingTableRecord,
-     MulticastForwardingTableRecord,
      ServiceAssociationRecord
      other optional records supported "under the table"
 
diff --git a/osm/opensm/osm_sa_mad_ctrl.c b/osm/opensm/osm_sa_mad_ctrl.c
index 90c732d..85d0b2a 100644
--- a/osm/opensm/osm_sa_mad_ctrl.c
+++ b/osm/opensm/osm_sa_mad_ctrl.c
@@ -216,6 +216,10 @@ __osm_sa_mad_ctrl_process(
     msg_id = OSM_MSG_MAD_SWITCH_INFO_RECORD;
     break;
 
+  case IB_MAD_ATTR_MFT_RECORD:
+    msg_id = OSM_MSG_MAD_MFT_RECORD;
+    break;
+
 #if defined (VENDOR_RMPP_SUPPORT) && defined (DUAL_SIDED_RMPP)
   case IB_MAD_ATTR_MULTIPATH_RECORD:
     msg_id = OSM_MSG_MAD_MULTIPATH_RECORD;


From halr at voltaire.com  Fri Dec 29 09:12:30 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 29 Dec 2006 12:12:30 -0500
Subject: [openib-general] [PATCH 4/4] osmtest/osmtest.c: Add SA MFTRecord
	tests
Message-ID: <1167412348.29620.227809.camel@hal.voltaire.com>

osmtest/osmtest.c: Add SA MFTRecord tests

Signed-off-by: Hal Rosenstock <halr at voltaire.com>

diff --git a/osm/osmtest/osmtest.c b/osm/osmtest/osmtest.c
index 3dd229c..ba42fc6 100644
--- a/osm/osmtest/osmtest.c
+++ b/osm/osmtest/osmtest.c
@@ -4854,6 +4854,93 @@ osmtest_get_lft_rec_by_lid( IN osmtest_t
 }
 
 /**********************************************************************
+ * Get MFT record by LID
+ **********************************************************************/
+ib_api_status_t
+osmtest_get_mft_rec_by_lid( IN osmtest_t * const p_osmt,
+                            IN ib_net16_t const  lid,
+                            IN OUT osmtest_req_context_t * const p_context )
+{
+  ib_api_status_t status = IB_SUCCESS;
+  osmv_user_query_t user;
+  osmv_query_req_t req;
+  ib_mft_record_t record;
+  ib_mad_t *p_mad;
+
+  OSM_LOG_ENTER( &p_osmt->log, osmtest_get_mft_rec_by_lid );
+
+  if( osm_log_is_active( &p_osmt->log, OSM_LOG_VERBOSE ) )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_VERBOSE,
+             "osmtest_get_mft_rec_by_lid: "
+             "Getting MFT record for LID 0x%02X\n",
+             cl_ntoh16( lid ) );
+  }
+
+  /*
+   * Do a blocking query for this record in the subnet.
+   * The result is returned in the result field of the caller's
+   * context structure.
+   *
+   * The query structures are locals.
+   */
+  memset( &req, 0, sizeof( req ) );
+  memset( &user, 0, sizeof( user ) );
+  memset( &record, 0, sizeof( record ) );
+
+  record.lid = lid;
+  p_context->p_osmt = p_osmt;
+  if (lid)
+    user.comp_mask = IB_MFTR_COMPMASK_LID;
+  user.attr_id = IB_MAD_ATTR_MFT_RECORD;
+  user.attr_offset = cl_ntoh16( ( uint16_t ) ( sizeof( record ) >> 3 ) );
+  user.p_attr = &record;
+
+  req.query_type = OSMV_QUERY_USER_DEFINED;
+  req.timeout_ms = p_osmt->opt.transaction_timeout;
+  req.retry_cnt = p_osmt->opt.retry_count;
+    
+  req.flags = OSM_SA_FLAGS_SYNC;
+  req.query_context = p_context;
+  req.pfn_query_cb = osmtest_query_res_cb; 
+  req.p_query_input = &user;
+  req.sm_key = 0;
+
+  status = osmv_query_sa( p_osmt->h_bind, &req );
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_mft_rec_by_lid: ERR 009B: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    goto Exit;
+  }
+
+  status = p_context->result.status;
+
+  if( status != IB_SUCCESS )
+  {
+    osm_log( &p_osmt->log, OSM_LOG_ERROR,
+             "osmtest_get_mft_rec_by_lid: ERR 009C: "
+             "ib_query failed (%s)\n", ib_get_err_str( status ) );
+    if( status == IB_REMOTE_ERROR )
+    {
+      p_mad = osm_madw_get_mad_ptr( p_context->result.p_result_madw );
+      osm_log( &p_osmt->log, OSM_LOG_ERROR,
+               "osmtest_get_mft_rec_by_lid: "
+               "Remote error = %s\n",
+               ib_get_mad_status_str( p_mad ));
+
+      status = (ib_net16_t) (p_mad->status & IB_SMP_STATUS_MASK );
+    }
+    goto Exit;
+  }
+
+ Exit:
+  OSM_LOG_EXIT( &p_osmt->log );
+  return ( status );
+}
+
+/**********************************************************************
  **********************************************************************/
 static ib_api_status_t
 osmtest_sminfo_record_request(
@@ -5933,6 +6020,17 @@ osmtest_validate_against_db( IN osmtest_
   if ( status != IB_SUCCESS )
     goto Exit;
 
+  /* MFT Record tests */
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_mft_rec_by_lid( p_osmt, 0, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+  
+  memset( &context, 0, sizeof( context ) );
+  status = osmtest_get_mft_rec_by_lid( p_osmt, test_lid, &context );
+  if ( status != IB_SUCCESS )
+    goto Exit;
+
   /* Some LinkRecord tests */
   /* FromLID */
   memset( &context, 0, sizeof( context ) );
@@ -6288,6 +6386,12 @@ osmtest_validate_against_db( IN osmtest_
     if ( status != IB_SUCCESS )
       goto Exit;
 
+    /* Another MFT Record test */
+    memset( &context, 0, sizeof( context ) );
+    status = osmtest_get_mft_rec_by_lid( p_osmt, test_lid, &context );
+    if ( status != IB_SUCCESS )
+      goto Exit;
+
     /* More LinkRecord tests */
     /* FromLID */
     memset( &context, 0, sizeof( context ) );


From sean.hefty at intel.com  Fri Dec 29 14:21:28 2006
From: sean.hefty at intel.com (Hefty, Sean)
Date: Fri, 29 Dec 2006 14:21:28 -0800
Subject: [openib-general] [PATCH] rdma_cm: avoid port reuse after close
Message-ID: <E02592704DB9854BB0B2E7DD3A633B7702C127B4@orsmsx412.amr.corp.intel.com>

Randomize the starting port number, and avoid re-using port values
immediately after they are closed.  Instead, track the last port
value used and increment it every time a new port number is
assigned.

These changes are in response to Michael's comments from this (old)
thread:

http://openib.org/pipermail/openib-general/2006-September/025996.html

Signed-off-by: Sean Hefty <sean.hefty at intel.com>
---
I'm not sure if this is still needed, but I had it on my list of things
to someday try to do.  This should apply to 2.6.20-rc2.

diff --git a/drivers/infiniband/core/cma.c
b/drivers/infiniband/core/cma.c
index 533193d..23fdc45 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -71,6 +71,7 @@ static struct workqueue_struct *cma_wq;
 static DEFINE_IDR(sdp_ps);
 static DEFINE_IDR(tcp_ps);
 static DEFINE_IDR(udp_ps);
+static int next_port;
 
 struct cma_device {
 	struct list_head	list;
@@ -1711,33 +1712,74 @@ static int cma_alloc_port(struct idr *ps
 			  unsigned short snum)
 {
 	struct rdma_bind_list *bind_list;
-	int port, start, ret;
+	int port, ret;
 
 	bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
 	if (!bind_list)
 		return -ENOMEM;
 
-	start = snum ? snum : sysctl_local_port_range[0];
+	do {
+		ret = idr_get_new_above(ps, bind_list, snum, &port);
+	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
+
+	if (ret)
+		goto err1;
+
+	if (port != snum) {
+		ret = -EADDRNOTAVAIL;
+		goto err2;
+	}
+
+	bind_list->ps = ps;
+	bind_list->port = (unsigned short) port;
+	cma_bind_port(bind_list, id_priv);
+	return 0;
+err2:
+	idr_remove(ps, port);
+err1:
+	kfree(bind_list);
+	return ret;
+}
 
+static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private
*id_priv)
+{
+	struct rdma_bind_list *bind_list;
+	int port, ret;
+
+	bind_list = kzalloc(sizeof *bind_list, GFP_KERNEL);
+	if (!bind_list)
+		return -ENOMEM;
+
+retry:
 	do {
-		ret = idr_get_new_above(ps, bind_list, start, &port);
+		ret = idr_get_new_above(ps, bind_list, next_port,
&port);
 	} while ((ret == -EAGAIN) && idr_pre_get(ps, GFP_KERNEL));
 
 	if (ret)
-		goto err;
+		goto err1;
 
-	if ((snum && port != snum) ||
-	    (!snum && port > sysctl_local_port_range[1])) {
-		idr_remove(ps, port);
+	if (port > sysctl_local_port_range[1]) {
+		if (next_port != sysctl_local_port_range[0]) {
+			idr_remove(ps, port);
+			next_port = sysctl_local_port_range[0];
+			goto retry;
+		}
 		ret = -EADDRNOTAVAIL;
-		goto err;
+		goto err2;
 	}
 
+	if (port == sysctl_local_port_range[1])
+		next_port = sysctl_local_port_range[0];
+	else
+		next_port = port + 1;
+
 	bind_list->ps = ps;
 	bind_list->port = (unsigned short) port;
 	cma_bind_port(bind_list, id_priv);
 	return 0;
-err:
+err2:
+	idr_remove(ps, port);
+err1:
 	kfree(bind_list);
 	return ret;
 }
@@ -1800,7 +1842,7 @@ static int cma_get_port(struct rdma_id_p
 
 	mutex_lock(&lock);
 	if (cma_any_port(&id_priv->id.route.addr.src_addr))
-		ret = cma_alloc_port(ps, id_priv, 0);
+		ret = cma_alloc_any_port(ps, id_priv);
 	else
 		ret = cma_use_port(ps, id_priv);
 	mutex_unlock(&lock);
@@ -2437,6 +2479,10 @@ static int cma_init(void)
 {
 	int ret;
 
+	get_random_bytes(&next_port, sizeof next_port);
+	next_port = (next_port % (sysctl_local_port_range[1] -
+				  sysctl_local_port_range[0])) +
+		    sysctl_local_port_range[0];
 	cma_wq = create_singlethread_workqueue("rdma_cm_wq");
 	if (!cma_wq)
 		return -ENOMEM;


From eitan at sw053.yok.mtl.com  Fri Dec 29 21:25:38 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Sat, 30 Dec 2006 07:25:38 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal completion
Message-ID: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
Total=405 Pass=330 Fail=75

Pass:
45 Stability IS1-16.topo
45 Pkey IS1-16.topo
45 OsmStress IS1-16.topo
45 Multicast IS1-16.topo
45 LidMgr IS1-16.topo
15 Stability IS3-loop.topo
15 Stability IS3-128.topo
15 Pkey IS3-128.topo
15 OsmStress IS3-128.topo
15 Multicast IS3-loop.topo
15 Multicast IS3-128.topo
15 LidMgr IS3-128.topo

Failures:
45 OsmTest IS1-16.topo
15 OsmTest IS3-loop.topo
15 OsmTest IS3-128.topo


From halr at voltaire.com  Sat Dec 30 04:09:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Dec 2006 07:09:18 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal
	completion
In-Reply-To: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
Message-ID: <1167480536.29620.286425.camel@hal.voltaire.com>

Hi Eitan,

On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
> Total=405 Pass=330 Fail=75
> 
> Pass:
> 45 Stability IS1-16.topo
> 45 Pkey IS1-16.topo
> 45 OsmStress IS1-16.topo
> 45 Multicast IS1-16.topo
> 45 LidMgr IS1-16.topo
> 15 Stability IS3-loop.topo
> 15 Stability IS3-128.topo
> 15 Pkey IS3-128.topo
> 15 OsmStress IS3-128.topo
> 15 Multicast IS3-loop.topo
> 15 Multicast IS3-128.topo
> 15 LidMgr IS3-128.topo
> 
> Failures:
> 45 OsmTest IS1-16.topo
> 15 OsmTest IS3-loop.topo
> 15 OsmTest IS3-128.topo

Any idea on these osmtest failures ? I did add SA MFTRecord yesterday
and made a change to SA LFTRecord and SwitchInfoRecord the day before as
well as additional osmtests for MFTRecord and LFTRecord.

Also, why are osmtest failures allowed for "normal completion" ?

-- Hal


From eitan at mellanox.co.il  Sat Dec 30 13:03:25 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 30 Dec 2006 23:03:25 +0200
Subject: [openib-general] [PATCH] osm: fat-tree documentation
In-Reply-To: <1167410047.29620.225730.camel@hal.voltaire.com>
References: <45929D0B.3090308@dev.mellanox.co.il>
	<1167240747.29620.77561.camel@hal.voltaire.com>
	<1167410047.29620.225730.camel@hal.voltaire.com>
Message-ID: <4596D41D.3080607@mellanox.co.il>

Hal Rosenstock wrote:
> On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote:
>   
>> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote:
>>     
>>> Hi Hal.
>>>
>>> Added fat-tree routing details and some cosmetics in the txt files.
>>>
>>> --
>>> Yevgeny
>>>
>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>>>       
>> Thanks. Applied.
>>
>> A couple of minor questions:
>>
>> Should similar text as in current-routing.txt be added to the OpenSM man
>> page ?
>>     
>
> I took care of making the man page including the fat tree routing
> information you put into current-routing.txt.
>
> The question below is outstanding:
>
>   
>> Also, rather than HCA in the below, is CA better (to include TCAs as
>> well) ?
>>     
>
>   
I agree CA is better then HCA.
Hal, can you take it or want a patch?
> Thanks.
>
> -- Hal
>
>   
>> -- Hal
>>     
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From kliteyn at dev.mellanox.co.il  Sat Dec 30 13:07:10 2006
From: kliteyn at dev.mellanox.co.il (Yevgeny Kliteynik)
Date: Sat, 30 Dec 2006 23:07:10 +0200
Subject: [openib-general] [PATCH] osm: fat-tree documentation
In-Reply-To: <1167410047.29620.225730.camel@hal.voltaire.com>
References: <45929D0B.3090308@dev.mellanox.co.il>
	<1167240747.29620.77561.camel@hal.voltaire.com>
	<1167410047.29620.225730.camel@hal.voltaire.com>
Message-ID: <4596D4FE.4000307@dev.mellanox.co.il>

Hal Rosenstock wrote:
> On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote:
>> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote:
>>> Hi Hal.
>>>
>>> Added fat-tree routing details and some cosmetics in the txt files.
>>>
>>> --
>>> Yevgeny
>>>
>>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
>> Thanks. Applied.
>>
>> A couple of minor questions:
>>
>> Should similar text as in current-routing.txt be added to the OpenSM man
>> page ?
> 
> I took care of making the man page including the fat tree routing
> information you put into current-routing.txt.

Thanks.

> The question below is outstanding:
> 
>> Also, rather than HCA in the below, is CA better (to include TCAs as
>> well) ?

Right, CA is better.

-- Yevgeny
 
> Thanks.
> 
> -- Hal
> 
>> -- Hal
> 


From eitan at mellanox.co.il  Sat Dec 30 13:12:01 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 30 Dec 2006 23:12:01 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal
 completion
In-Reply-To: <1167480536.29620.286425.camel@hal.voltaire.com>
References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
	<1167480536.29620.286425.camel@hal.voltaire.com>
Message-ID: <4596D621.80207@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote:
>   
>> OSM Simulation Regression Summary
>> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
>> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
>> Total=405 Pass=330 Fail=75
>>
>> Pass:
>> 45 Stability IS1-16.topo
>> 45 Pkey IS1-16.topo
>> 45 OsmStress IS1-16.topo
>> 45 Multicast IS1-16.topo
>> 45 LidMgr IS1-16.topo
>> 15 Stability IS3-loop.topo
>> 15 Stability IS3-128.topo
>> 15 Pkey IS3-128.topo
>> 15 OsmStress IS3-128.topo
>> 15 Multicast IS3-loop.topo
>> 15 Multicast IS3-128.topo
>> 15 LidMgr IS3-128.topo
>>
>> Failures:
>> 45 OsmTest IS1-16.topo
>> 15 OsmTest IS3-loop.topo
>> 15 OsmTest IS3-128.topo
>>     
>
> Any idea on these osmtest failures ? I did add SA MFTRecord yesterday
> and made a change to SA LFTRecord and SwitchInfoRecord the day before as
> well as additional osmtests for MFTRecord and LFTRecord.
>   
I get
Dec 30 07:13:20 163508 [B7F1F8E0] -> osmtest_get_sw_info_rec_by_lid: 
Getting SwitchInfo record for LID 0x01
Dec 30 07:13:20 165737 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting 
LFT record for LID 0x00
Dec 30 07:13:20 169968 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting 
LFT record for LID 0x01
Dec 30 07:13:20 172573 [B7F1F8E0] -> osmtest_get_mft_rec_by_lid: Getting 
MFT record for LID 0x00
Dec 30 07:13:50 182807 [B7F1EBB0] -> __osmv_txn_timeout_cb: ERR 6702: 
The transaction request (tid=0x26) timed out (after 4 retrie
s). Invoking the error callback.
Dec 30 07:13:50 182964 [B7F1EBB0] -> osmtest_query_res_cb: ERR 0003: 
Error on query (IB_TIMEOUT)

I wonder where the LID=0 comes from. Might be a simulation issue but not 
sure.I will double check tomorrow.

> Also, why are osmtest failures allowed for "normal completion" ?
>   
"Normal completion" means  completion without  resource issues. Unlike 
"disk full".
> -- Hal
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Sat Dec 30 13:21:17 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Dec 2006 16:21:17 -0500
Subject: [openib-general] [PATCH] osm: fat-tree documentation
In-Reply-To: <4596D41D.3080607@mellanox.co.il>
References: <45929D0B.3090308@dev.mellanox.co.il>
	<1167240747.29620.77561.camel@hal.voltaire.com>
	<1167410047.29620.225730.camel@hal.voltaire.com>
	<4596D41D.3080607@mellanox.co.il>
Message-ID: <1167513670.29620.315478.camel@hal.voltaire.com>

On Sat, 2006-12-30 at 16:03, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > On Wed, 2006-12-27 at 12:32, Hal Rosenstock wrote:
> >   
> >> On Wed, 2006-12-27 at 11:19, Yevgeny Kliteynik wrote:
> >>     
> >>> Hi Hal.
> >>>
> >>> Added fat-tree routing details and some cosmetics in the txt files.
> >>>
> >>> --
> >>> Yevgeny
> >>>
> >>> Signed-off-by: Yevgeny Kliteynik <kliteyn at dev.mellanox.co.il>
> >>>       
> >> Thanks. Applied.
> >>
> >> A couple of minor questions:
> >>
> >> Should similar text as in current-routing.txt be added to the OpenSM man
> >> page ?
> >>     
> >
> > I took care of making the man page including the fat tree routing
> > information you put into current-routing.txt.
> >
> > The question below is outstanding:
> >
> >   
> >> Also, rather than HCA in the below, is CA better (to include TCAs as
> >> well) ?
> >>     
> >
> >   
> I agree CA is better then HCA.
> Hal, can you take it or want a patch?

I'll change this.

-- Hal

> > Thanks.
> >
> > -- Hal
> >
> >   
> >> -- Hal
> >>     
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From halr at voltaire.com  Sat Dec 30 13:24:29 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Dec 2006 16:24:29 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal
 completion
In-Reply-To: <4596D621.80207@mellanox.co.il>
References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
	<1167480536.29620.286425.camel@hal.voltaire.com>
	<4596D621.80207@mellanox.co.il>
Message-ID: <1167513866.29620.315610.camel@hal.voltaire.com>

On Sat, 2006-12-30 at 16:12, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote:
> >   
> >> OSM Simulation Regression Summary
> >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
> >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
> >> Total=405 Pass=330 Fail=75
> >>
> >> Pass:
> >> 45 Stability IS1-16.topo
> >> 45 Pkey IS1-16.topo
> >> 45 OsmStress IS1-16.topo
> >> 45 Multicast IS1-16.topo
> >> 45 LidMgr IS1-16.topo
> >> 15 Stability IS3-loop.topo
> >> 15 Stability IS3-128.topo
> >> 15 Pkey IS3-128.topo
> >> 15 OsmStress IS3-128.topo
> >> 15 Multicast IS3-loop.topo
> >> 15 Multicast IS3-128.topo
> >> 15 LidMgr IS3-128.topo
> >>
> >> Failures:
> >> 45 OsmTest IS1-16.topo
> >> 15 OsmTest IS3-loop.topo
> >> 15 OsmTest IS3-128.topo
> >>     
> >
> > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday
> > and made a change to SA LFTRecord and SwitchInfoRecord the day before as
> > well as additional osmtests for MFTRecord and LFTRecord.
> >   
> I get
> Dec 30 07:13:20 163508 [B7F1F8E0] -> osmtest_get_sw_info_rec_by_lid: 
> Getting SwitchInfo record for LID 0x01
> Dec 30 07:13:20 165737 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting 
> LFT record for LID 0x00
> Dec 30 07:13:20 169968 [B7F1F8E0] -> osmtest_get_lft_rec_by_lid: Getting 
> LFT record for LID 0x01
> Dec 30 07:13:20 172573 [B7F1F8E0] -> osmtest_get_mft_rec_by_lid: Getting 
> MFT record for LID 0x00
> Dec 30 07:13:50 182807 [B7F1EBB0] -> __osmv_txn_timeout_cb: ERR 6702: 
> The transaction request (tid=0x26) timed out (after 4 retrie
> s). Invoking the error callback.
> Dec 30 07:13:50 182964 [B7F1EBB0] -> osmtest_query_res_cb: ERR 0003: 
> Error on query (IB_TIMEOUT)
> 
> I wonder where the LID=0 comes from.

This is currently by "design". It is used to wildcard rather than an
additional parameter to set the component mask:

  /* LFT Record tests */
  memset( &context, 0, sizeof( context ) );
  status = osmtest_get_lft_rec_by_lid( p_osmt, 0, &context );

...

  /* MFT Record tests */
  memset( &context, 0, sizeof( context ) );
  status = osmtest_get_mft_rec_by_lid( p_osmt, 0, &context );

It seems like you might not have rebuilt OpenSM properly though to add
the SA MFTRecord handler.

>  Might be a simulation issue but not 
> sure.I will double check tomorrow.
> 
> > Also, why are osmtest failures allowed for "normal completion" ?
> >   
> "Normal completion" means  completion without  resource issues. Unlike 
> "disk full".

OK; I thought normal completion indicated something about success or
failure.

-- Hal

> > -- Hal
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From eitan at mellanox.co.il  Sat Dec 30 13:33:57 2006
From: eitan at mellanox.co.il (Eitan Zahavi)
Date: Sat, 30 Dec 2006 23:33:57 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal
 completion
In-Reply-To: <1167480536.29620.286425.camel@hal.voltaire.com>
References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
	<1167480536.29620.286425.camel@hal.voltaire.com>
Message-ID: <4596DB45.5070108@mellanox.co.il>

Hal Rosenstock wrote:
> Hi Eitan,
>
> On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote:
>   
>> OSM Simulation Regression Summary
>> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
>> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
>> Total=405 Pass=330 Fail=75
>>
>> Pass:
>> 45 Stability IS1-16.topo
>> 45 Pkey IS1-16.topo
>> 45 OsmStress IS1-16.topo
>> 45 Multicast IS1-16.topo
>> 45 LidMgr IS1-16.topo
>> 15 Stability IS3-loop.topo
>> 15 Stability IS3-128.topo
>> 15 Pkey IS3-128.topo
>> 15 OsmStress IS3-128.topo
>> 15 Multicast IS3-loop.topo
>> 15 Multicast IS3-128.topo
>> 15 LidMgr IS3-128.topo
>>
>> Failures:
>> 45 OsmTest IS1-16.topo
>> 15 OsmTest IS3-loop.topo
>> 15 OsmTest IS3-128.topo
>>     
>
> Any idea on these osmtest failures ? I did add SA MFTRecord yesterday
> and made a change to SA LFTRecord and SwitchInfoRecord the day before as
> well as additional osmtests for MFTRecord and LFTRecord.
>   
Actually I get a core dump:
#0  0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, 
block_num=-32575, position=0 '\0', p_block=0xb19e4d2c)
    at osm_mcast_tbl.c:299
299         p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position];

(gdb) p i
$1 = 2
(gdb) p mlid_start_ho
$2 = 6176
(gdb) p position
$3 = 0 '\0'
(gdb) where
#0  0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, 
block_num=-32575, position=0 '\0', p_block=0xb19e4d2c)
    at osm_mcast_tbl.c:299
#1  0x08073d29 in osm_switch_get_mft_block (p_sw=0x8f6eed8, 
block_num=32961, position=0 '\0', p_block=0xb19e4d2c)
    at ./../include/opensm/osm_switch.h:1074
#2  0x08073b8c in __osm_mftr_rcv_new_mftr (p_rcv=0x80e9a6c, 
p_sw=0x8f6eed8, p_list=0xb61c0370, lid=512, block=32961,
    position=0 '\0') at osm_sa_mft_record.c:181
#3  0x08074273 in __osm_mftr_rcv_by_comp_mask (p_map_item=0x8f6eed8, 
context=0xb61c0330) at osm_sa_mft_record.c:317
#4  0x00cd9747 in cl_qmap_apply_func (p_map=0x80e8584, 
pfn_func=0x8073f98 <__osm_mftr_rcv_by_comp_mask>, context=0xb61c0330)
    at cl_map.c:287
#5  0x08074653 in osm_mftr_rcv_process (p_rcv=0x80e9a6c, 
p_madw=0x8f29f0c) at osm_sa_mft_record.c:390
#6  0x08074ef2 in __osm_mftr_rcv_ctrl_disp_callback (context=0x80e9afc, 
p_data=0x8f29f0c) at osm_sa_mft_record_ctrl.c:63
#7  0x00cd3d4f in __cl_disp_worker (context=0x80e9d18) at 
cl_dispatcher.c:102
#8  0x00ce1297 in __cl_thread_pool_routine (context=0x80e9d5c) at 
cl_threadpool.c:74
#9  0x00ce0f61 in __cl_thread_wrapper (arg=0x8f1c690) at cl_thread.c:58
#10 0x00361371 in start_thread () from /lib/tls/libpthread.so.0
#11 0x001eaffe in clone () from /lib/tls/libc.so.6


> Also, why are osmtest failures allowed for "normal completion" ?
>
> -- Hal
>
>
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>   


From halr at voltaire.com  Sat Dec 30 14:25:00 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 30 Dec 2006 17:25:00 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-30:normal
 completion
In-Reply-To: <4596DB45.5070108@mellanox.co.il>
References: <200612300525.kBU5Pcr2016005@sw053.yok.mtl.com>
	<1167480536.29620.286425.camel@hal.voltaire.com>
	<4596DB45.5070108@mellanox.co.il>
Message-ID: <1167517497.29620.318774.camel@hal.voltaire.com>

On Sat, 2006-12-30 at 16:33, Eitan Zahavi wrote:
> Hal Rosenstock wrote:
> > Hi Eitan,
> >
> > On Sat, 2006-12-30 at 00:25, Eitan Zahavi wrote:
> >   
> >> OSM Simulation Regression Summary
> >> OpenSM rev = Fri_Dec_29_12:19:08_2006 2e0f81 
> >> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
> >> Total=405 Pass=330 Fail=75
> >>
> >> Pass:
> >> 45 Stability IS1-16.topo
> >> 45 Pkey IS1-16.topo
> >> 45 OsmStress IS1-16.topo
> >> 45 Multicast IS1-16.topo
> >> 45 LidMgr IS1-16.topo
> >> 15 Stability IS3-loop.topo
> >> 15 Stability IS3-128.topo
> >> 15 Pkey IS3-128.topo
> >> 15 OsmStress IS3-128.topo
> >> 15 Multicast IS3-loop.topo
> >> 15 Multicast IS3-128.topo
> >> 15 LidMgr IS3-128.topo
> >>
> >> Failures:
> >> 45 OsmTest IS1-16.topo
> >> 15 OsmTest IS3-loop.topo
> >> 15 OsmTest IS3-128.topo
> >>     
> >
> > Any idea on these osmtest failures ? I did add SA MFTRecord yesterday
> > and made a change to SA LFTRecord and SwitchInfoRecord the day before as
> > well as additional osmtests for MFTRecord and LFTRecord.
> >   
> Actually I get a core dump:

Thanks for providing this!

> #0  0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, 
> block_num=-32575, position=0 '\0', p_block=0xb19e4d2c)
>     at osm_mcast_tbl.c:299
> 299         p_block[i] = (*p_tbl->p_mask_tbl)[mlid_start_ho + i][position];
> 
> (gdb) p i
> $1 = 2
> (gdb) p mlid_start_ho
> $2 = 6176
> (gdb) p position
> $3 = 0 '\0'
> (gdb) where
> #0  0x0805c265 in osm_mcast_tbl_get_block (p_tbl=0x8f6ef6c, 
> block_num=-32575, position=0 '\0', p_block=0xb19e4d2c)
>     at osm_mcast_tbl.c:299
> #1  0x08073d29 in osm_switch_get_mft_block (p_sw=0x8f6eed8, 
> block_num=32961, position=0 '\0', p_block=0xb19e4d2c)
>     at ./../include/opensm/osm_switch.h:1074
> #2  0x08073b8c in __osm_mftr_rcv_new_mftr (p_rcv=0x80e9a6c, 
> p_sw=0x8f6eed8, p_list=0xb61c0370, lid=512, block=32961,
                                                    ^^^^^
max block number is 511 so this is what caused the core dump.
I just checked in a patch for this which should work.

-- Hal

>     position=0 '\0') at osm_sa_mft_record.c:181
> #3  0x08074273 in __osm_mftr_rcv_by_comp_mask (p_map_item=0x8f6eed8, 
> context=0xb61c0330) at osm_sa_mft_record.c:317
> #4  0x00cd9747 in cl_qmap_apply_func (p_map=0x80e8584, 
> pfn_func=0x8073f98 <__osm_mftr_rcv_by_comp_mask>, context=0xb61c0330)
>     at cl_map.c:287
> #5  0x08074653 in osm_mftr_rcv_process (p_rcv=0x80e9a6c, 
> p_madw=0x8f29f0c) at osm_sa_mft_record.c:390
> #6  0x08074ef2 in __osm_mftr_rcv_ctrl_disp_callback (context=0x80e9afc, 
> p_data=0x8f29f0c) at osm_sa_mft_record_ctrl.c:63
> #7  0x00cd3d4f in __cl_disp_worker (context=0x80e9d18) at 
> cl_dispatcher.c:102
> #8  0x00ce1297 in __cl_thread_pool_routine (context=0x80e9d5c) at 
> cl_threadpool.c:74
> #9  0x00ce0f61 in __cl_thread_wrapper (arg=0x8f1c690) at cl_thread.c:58
> #10 0x00361371 in start_thread () from /lib/tls/libpthread.so.0
> #11 0x001eaffe in clone () from /lib/tls/libc.so.6
> 
> 
> > Also, why are osmtest failures allowed for "normal completion" ?
> >
> > -- Hal
> >
> >
> >
> > _______________________________________________
> > openib-general mailing list
> > openib-general at openib.org
> > http://openib.org/mailman/listinfo/openib-general
> >
> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >   
> 


From gfiyasmer1 at verizon.com  Sat Dec 30 19:24:14 2006
From: gfiyasmer1 at verizon.com (=?windows-1255?Q?=E2=E9=EC?=)
Date: Sun, 31 Dec 2006 05:24:14 +0200
Subject: [openib-general] =?windows-1255?b?4OnqIOzk+OXl6ecgIDIwMDAgpCAg?=
	=?windows-1255?b?4efl4/kg7uTx7OXs4PjpIPns6iA/IQ==?=
Message-ID: <8d4865a43d10b7fc4769e83e001b8393@verizon.com>

An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20061231/e3fb0fb4/attachment.html>

From eitan at sw053.yok.mtl.com  Sat Dec 30 21:01:23 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Sun, 31 Dec 2006 07:01:23 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-31:normal completion
Message-ID: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Fri_Dec_29_16:01:04_2006 0ccdf3 
ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
Total=378 Pass=308 Fail=70

Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo

Failures:
42 OsmTest IS1-16.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo


From dotanb at dev.mellanox.co.il  Sun Dec 31 01:50:52 2006
From: dotanb at dev.mellanox.co.il (dotanb at dev.mellanox.co.il)
Date: Sun, 31 Dec 2006 11:50:52 +0200 (IST)
Subject: [openib-general] [rdma_ucm] enabling the rdma_ucm and
 restarting the driver several times causes kernel oops
In-Reply-To: <4593FBD2.4000109@ichips.intel.com>
References: <459381DA.7030007@dev.mellanox.co.il>
	<4593FBD2.4000109@ichips.intel.com>
Message-ID: <1296.85.65.224.155.1167558652.squirrel@dev.mellanox.co.il>

> Dotan Barak wrote:
>> here is the backtrace from the /var/log/messages:
>> Dec 27 15:36:25 sw086 kernel: Unable to handle kernel NULL pointer
>> dereference at 0000000000000001 RIP:
>> Dec 27 15:36:25 sw086 kernel:  [<0000000000000001>]
>> Dec 27 15:36:25 sw086 kernel: PGD 11f4c3067 PUD 11fed7067 PMD 0
>> Dec 27 15:36:25 sw086 kernel: Oops: 0000 [1] SMP
>> Dec 27 15:36:25 sw086 kernel: CPU 1
>> Dec 27 15:36:25 sw086 kernel: Modules linked in: rdma_ucm ib_sdp rdma_cm
>> iw_cm ib_addr ib_ipoib ib_mthca ib_umad ib_ucm ib_u
>> verbs ib_cm ib_sa ib_mad ib_core nfsd exportfs ipv6 parport_pc lp
>> parport autofs4 nfs lockd nfs_acl sunrpc dm_mirror dm_mod
>> button battery asus_acpi ac uhci_hcd ehci_hcd i2c_i801 i2c_core tg3 sg
>> ext3 jbd sd_mod
>
> Can you narrow down which module unload is causing the issue?  Is anything
> using
> the rdma_ucm or ib_uverbs?  Is ib_sdp the first module unloaded?
>
> - Sean
>
vlad: can you please tell us what the order of the modules
loading/unloading is?

As i wrote in the problem description, I only enabled the rdma_ucm
module and restarted the driver (without even using this module or any
other module in the driver) and i got this kernel oops.

Dotan


From halr at voltaire.com  Sun Dec 31 04:36:18 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 Dec 2006 07:36:18 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-31:normal
	completion
In-Reply-To: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com>
References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com>
Message-ID: <1167568561.29620.364215.camel@hal.voltaire.com>

Hi Eitan,

On Sun, 2006-12-31 at 00:01, Eitan Zahavi wrote:
> OSM Simulation Regression Summary
> OpenSM rev = Fri_Dec_29_16:01:04_2006 0ccdf3 

How can you tell what is is the latest git version included ? Is 0ccdf3
short for some SHA1 hash ?

Was the latest opensm/osm_sa_mft_record.c change included in this ?

-- Hal

> ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
> Total=378 Pass=308 Fail=70
> 
> Pass:
> 42 Stability IS1-16.topo
> 42 Pkey IS1-16.topo
> 42 OsmStress IS1-16.topo
> 42 Multicast IS1-16.topo
> 42 LidMgr IS1-16.topo
> 14 Stability IS3-loop.topo
> 14 Stability IS3-128.topo
> 14 Pkey IS3-128.topo
> 14 OsmStress IS3-128.topo
> 14 Multicast IS3-loop.topo
> 14 Multicast IS3-128.topo
> 14 LidMgr IS3-128.topo
> 
> Failures:
> 42 OsmTest IS1-16.topo
> 14 OsmTest IS3-loop.topo
> 14 OsmTest IS3-128.topo


From mst at mellanox.co.il  Sun Dec 31 04:46:11 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 31 Dec 2006 14:46:11 +0200
Subject: [openib-general] nightly osm_sim report 2006-12-31:normal
	completion
In-Reply-To: <1167568561.29620.364215.camel@hal.voltaire.com>
References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com>
	<1167568561.29620.364215.camel@hal.voltaire.com>
Message-ID: <20061231124610.GC6083@mellanox.co.il>

> How can you tell what is is the latest git version included ? Is 0ccdf3
> short for some SHA1 hash ?
Try git show 0ccdf3

-- 
MST


From halr at voltaire.com  Sun Dec 31 04:54:59 2006
From: halr at voltaire.com (Hal Rosenstock)
Date: 31 Dec 2006 07:54:59 -0500
Subject: [openib-general] nightly osm_sim report 2006-12-31:normal
	completion
In-Reply-To: <20061231124610.GC6083@mellanox.co.il>
References: <200612310501.kBV51Nkj018487@sw053.yok.mtl.com>
	<1167568561.29620.364215.camel@hal.voltaire.com>
	<20061231124610.GC6083@mellanox.co.il>
Message-ID: <1167569697.29620.365179.camel@hal.voltaire.com>

On Sun, 2006-12-31 at 07:46, Michael S. Tsirkin wrote:
> > How can you tell what is is the latest git version included ? Is 0ccdf3
> > short for some SHA1 hash ?
> Try git show 0ccdf3

Thanks.

The potential fix is not included in the OpenSM build (as this commit
was at the end of Friday rather than Saturday).

When are the updates picked up into the build system ?

-- Hal


From mst at mellanox.co.il  Sun Dec 31 11:09:42 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Sun, 31 Dec 2006 21:09:42 +0200
Subject: [openib-general] [PATCH RFC] return qp pointer as part of ib_wc
Message-ID: <20061231190942.GB32485@mellanox.co.il>

ib_wc currently only includes the local QP number: this matches the IB spec,
but seems mostly useless. The following patch replaces this with the pointer
to qp itself, and updates all low level drivers and all users.

This has the following advantages:
- Ability to get a per-qp context through wc->qp->qp_context
- Existing drivers already have the qp pointer ready in poll cq, so
  this change actually saves a tiny bit (extra memory read) on data path
- We will be able to put NULL in there if some hardware does not support
  reporting the qp number (it is optional in IB spec) - no such option with qpn
- Users that need the QP number can still get it through wc->qp->qp_num.

Use case:

In IPoIB CM code, I have a common CQ shared by multiple QPs.
To track connection usage, I need a way to get at some per-QP context
upon the completion, and I would like to avoid allocating
context object per work request just to stick a QP pointer into it.
With this code, I can just use wc->qp->qp_context.

Note:

I don't know whether updating the userspace API in a similiar way
is a good idea. We probably should wait for an actual user;
and keeping an extra object pointed to by WR ID might be
less of a problem there since virtual memory is cheap.

Signed-off-by: Michael S. Tsirkin <mst at mellanox.co.il>

---

Untested. Please comment.


diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 5ed141e..13efd41 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -642,7 +642,8 @@ static void snoop_recv(struct ib_mad_qp_info *qp_info,
 	spin_unlock_irqrestore(&qp_info->snoop_lock, flags);
 }
 
-static void build_smp_wc(u64 wr_id, u16 slid, u16 pkey_index, u8 port_num,
+static void build_smp_wc(struct ib_qp *qp,
+			 u64 wr_id, u16 slid, u16 pkey_index, u8 port_num,
 			 struct ib_wc *wc)
 {
 	memset(wc, 0, sizeof *wc);
@@ -652,7 +653,7 @@ static void build_smp_wc(u64 wr_id, u16 slid, u16 pkey_index, u8 port_num,
 	wc->pkey_index = pkey_index;
 	wc->byte_len = sizeof(struct ib_mad) + sizeof(struct ib_grh);
 	wc->src_qp = IB_QP0;
-	wc->qp_num = IB_QP0;
+	wc->qp = qp;
 	wc->slid = slid;
 	wc->sl = 0;
 	wc->dlid_path_bits = 0;
@@ -713,7 +714,8 @@ static int handle_outgoing_dr_smp(struct ib_mad_agent_private *mad_agent_priv,
 		goto out;
 	}
 
-	build_smp_wc(send_wr->wr_id, be16_to_cpu(smp->dr_slid),
+	build_smp_wc(mad_agent_priv->agent.qp,
+		     send_wr->wr_id, be16_to_cpu(smp->dr_slid),
 		     send_wr->wr.ud.pkey_index,
 		     send_wr->wr.ud.port_num, &mad_wc);
 
@@ -2355,7 +2357,8 @@ static void local_completions(struct work_struct *work)
 			 * Defined behavior is to complete response
 			 * before request
 			 */
-			build_smp_wc((unsigned long) local->mad_send_wr,
+			build_smp_wc(recv_mad_agent->agent.qp,
+				     (unsigned long) local->mad_send_wr,
 				     be16_to_cpu(IB_LID_PERMISSIVE),
 				     0, recv_mad_agent->agent.port_num, &wc);
 
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 743247e..df1efbc 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -933,7 +933,7 @@ ssize_t ib_uverbs_poll_cq(struct ib_uverbs_file *file,
 		resp->wc[i].vendor_err 	   = wc[i].vendor_err;
 		resp->wc[i].byte_len 	   = wc[i].byte_len;
 		resp->wc[i].imm_data 	   = (__u32 __force) wc[i].imm_data;
-		resp->wc[i].qp_num 	   = wc[i].qp_num;
+		resp->wc[i].qp_num 	   = wc[i].qp->qp_num;
 		resp->wc[i].src_qp 	   = wc[i].src_qp;
 		resp->wc[i].wc_flags 	   = wc[i].wc_flags;
 		resp->wc[i].pkey_index 	   = wc[i].pkey_index;
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 05c9154..5175c99 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -153,7 +153,7 @@ static inline int c2_poll_one(struct c2_dev *c2dev,
 
 	entry->status = c2_cqe_status_to_openib(c2_wr_get_result(ce));
 	entry->wr_id = ce->hdr.context;
-	entry->qp_num = ce->handle;
+	entry->qp = &qp->ibqp;
 	entry->wc_flags = 0;
 	entry->slid = 0;
 	entry->sl = 0;
diff --git a/drivers/infiniband/hw/ehca/ehca_reqs.c b/drivers/infiniband/hw/ehca/ehca_reqs.c
index b46bda1..40e39ff 100644
--- a/drivers/infiniband/hw/ehca/ehca_reqs.c
+++ b/drivers/infiniband/hw/ehca/ehca_reqs.c
@@ -579,7 +579,7 @@ poll_cq_one_read_cqe:
 	} else
 		wc->status = IB_WC_SUCCESS;
 
-	wc->qp_num = cqe->local_qp_number;
+	wc->qp = &qp->ibqp;
 	wc->byte_len = cqe->nr_bytes_transferred;
 	wc->pkey_index = cqe->pkey_index;
 	wc->slid = cqe->rlid;
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 46c1c89..64f07b1 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -379,7 +379,7 @@ void ipath_error_qp(struct ipath_qp *qp, enum ib_wc_status err)
 	wc.vendor_err = 0;
 	wc.byte_len = 0;
 	wc.imm_data = 0;
-	wc.qp_num = qp->ibqp.qp_num;
+	wc.qp = &qp->ibqp;
 	wc.src_qp = 0;
 	wc.wc_flags = 0;
 	wc.pkey_index = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_rc.c b/drivers/infiniband/hw/ipath/ipath_rc.c
index ce60387..5ff20cb 100644
--- a/drivers/infiniband/hw/ipath/ipath_rc.c
+++ b/drivers/infiniband/hw/ipath/ipath_rc.c
@@ -702,7 +702,7 @@ void ipath_restart_rc(struct ipath_qp *qp, u32 psn, struct ib_wc *wc)
 		wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 		wc->vendor_err = 0;
 		wc->byte_len = 0;
-		wc->qp_num = qp->ibqp.qp_num;
+		wc->qp = &qp->ibqp;
 		wc->src_qp = qp->remote_qpn;
 		wc->pkey_index = 0;
 		wc->slid = qp->remote_ah_attr.dlid;
@@ -836,7 +836,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode)
 			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 			wc.vendor_err = 0;
 			wc.byte_len = wqe->length;
-			wc.qp_num = qp->ibqp.qp_num;
+			wc.qp = &qp->ibqp;
 			wc.src_qp = qp->remote_qpn;
 			wc.pkey_index = 0;
 			wc.slid = qp->remote_ah_attr.dlid;
@@ -951,7 +951,7 @@ static int do_rc_ack(struct ipath_qp *qp, u32 aeth, u32 psn, int opcode)
 			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 			wc.vendor_err = 0;
 			wc.byte_len = 0;
-			wc.qp_num = qp->ibqp.qp_num;
+			wc.qp = &qp->ibqp;
 			wc.src_qp = qp->remote_qpn;
 			wc.pkey_index = 0;
 			wc.slid = qp->remote_ah_attr.dlid;
@@ -1511,7 +1511,7 @@ void ipath_rc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		wc.status = IB_WC_SUCCESS;
 		wc.opcode = IB_WC_RECV;
 		wc.vendor_err = 0;
-		wc.qp_num = qp->ibqp.qp_num;
+		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
 		wc.pkey_index = 0;
 		wc.slid = qp->remote_ah_attr.dlid;
diff --git a/drivers/infiniband/hw/ipath/ipath_ruc.c b/drivers/infiniband/hw/ipath/ipath_ruc.c
index f753051..e86cb17 100644
--- a/drivers/infiniband/hw/ipath/ipath_ruc.c
+++ b/drivers/infiniband/hw/ipath/ipath_ruc.c
@@ -137,7 +137,7 @@ bad_lkey:
 	wc.vendor_err = 0;
 	wc.byte_len = 0;
 	wc.imm_data = 0;
-	wc.qp_num = qp->ibqp.qp_num;
+	wc.qp = &qp->ibqp;
 	wc.src_qp = 0;
 	wc.wc_flags = 0;
 	wc.pkey_index = 0;
@@ -336,7 +336,7 @@ again:
 			wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 			wc.vendor_err = 0;
 			wc.byte_len = 0;
-			wc.qp_num = sqp->ibqp.qp_num;
+			wc.qp = &sqp->ibqp;
 			wc.src_qp = sqp->remote_qpn;
 			wc.pkey_index = 0;
 			wc.slid = sqp->remote_ah_attr.dlid;
@@ -426,7 +426,7 @@ again:
 	wc.status = IB_WC_SUCCESS;
 	wc.vendor_err = 0;
 	wc.byte_len = wqe->length;
-	wc.qp_num = qp->ibqp.qp_num;
+	wc.qp = &qp->ibqp;
 	wc.src_qp = qp->remote_qpn;
 	/* XXX do we know which pkey matched? Only needed for GSI. */
 	wc.pkey_index = 0;
@@ -447,7 +447,7 @@ send_comp:
 		wc.opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 		wc.vendor_err = 0;
 		wc.byte_len = wqe->length;
-		wc.qp_num = sqp->ibqp.qp_num;
+		wc.qp = &sqp->ibqp;
 		wc.src_qp = 0;
 		wc.pkey_index = 0;
 		wc.slid = 0;
diff --git a/drivers/infiniband/hw/ipath/ipath_uc.c b/drivers/infiniband/hw/ipath/ipath_uc.c
index e636cfd..325d663 100644
--- a/drivers/infiniband/hw/ipath/ipath_uc.c
+++ b/drivers/infiniband/hw/ipath/ipath_uc.c
@@ -49,7 +49,7 @@ static void complete_last_send(struct ipath_qp *qp, struct ipath_swqe *wqe,
 		wc->opcode = ib_ipath_wc_opcode[wqe->wr.opcode];
 		wc->vendor_err = 0;
 		wc->byte_len = wqe->length;
-		wc->qp_num = qp->ibqp.qp_num;
+		wc->qp = &qp->ibqp;
 		wc->src_qp = qp->remote_qpn;
 		wc->pkey_index = 0;
 		wc->slid = qp->remote_ah_attr.dlid;
@@ -411,7 +411,7 @@ void ipath_uc_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 		wc.status = IB_WC_SUCCESS;
 		wc.opcode = IB_WC_RECV;
 		wc.vendor_err = 0;
-		wc.qp_num = qp->ibqp.qp_num;
+		wc.qp = &qp->ibqp;
 		wc.src_qp = qp->remote_qpn;
 		wc.pkey_index = 0;
 		wc.slid = qp->remote_ah_attr.dlid;
diff --git a/drivers/infiniband/hw/ipath/ipath_ud.c b/drivers/infiniband/hw/ipath/ipath_ud.c
index 49f1102..9a3e546 100644
--- a/drivers/infiniband/hw/ipath/ipath_ud.c
+++ b/drivers/infiniband/hw/ipath/ipath_ud.c
@@ -66,7 +66,7 @@ bad_lkey:
 	wc.vendor_err = 0;
 	wc.byte_len = 0;
 	wc.imm_data = 0;
-	wc.qp_num = qp->ibqp.qp_num;
+	wc.qp = &qp->ibqp;
 	wc.src_qp = 0;
 	wc.wc_flags = 0;
 	wc.pkey_index = 0;
@@ -255,7 +255,7 @@ static void ipath_ud_loopback(struct ipath_qp *sqp,
 	wc->status = IB_WC_SUCCESS;
 	wc->opcode = IB_WC_RECV;
 	wc->vendor_err = 0;
-	wc->qp_num = qp->ibqp.qp_num;
+	wc->qp = &qp->ibqp;
 	wc->src_qp = sqp->ibqp.qp_num;
 	/* XXX do we know which pkey matched? Only needed for GSI. */
 	wc->pkey_index = 0;
@@ -474,7 +474,7 @@ done:
 		wc.vendor_err = 0;
 		wc.opcode = IB_WC_SEND;
 		wc.byte_len = len;
-		wc.qp_num = qp->ibqp.qp_num;
+		wc.qp = &qp->ibqp;
 		wc.src_qp = 0;
 		wc.wc_flags = 0;
 		/* XXX initialize other fields? */
@@ -651,7 +651,7 @@ void ipath_ud_rcv(struct ipath_ibdev *dev, struct ipath_ib_header *hdr,
 	wc.status = IB_WC_SUCCESS;
 	wc.opcode = IB_WC_RECV;
 	wc.vendor_err = 0;
-	wc.qp_num = qp->ibqp.qp_num;
+	wc.qp = &qp->ibqp;
 	wc.src_qp = src_qp;
 	/* XXX do we know which pkey matched? Only needed for GSI. */
 	wc.pkey_index = 0;
diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 768df72..968d151 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -1854,7 +1854,7 @@ int mthca_MAD_IFC(struct mthca_dev *dev, int ignore_mkey, int ignore_bkey,
 
 		memset(inbox + 256, 0, 256);
 
-		MTHCA_PUT(inbox, in_wc->qp_num,     MAD_IFC_MY_QPN_OFFSET);
+		MTHCA_PUT(inbox, in_wc->qp->qp_num, MAD_IFC_MY_QPN_OFFSET);
 		MTHCA_PUT(inbox, in_wc->src_qp,     MAD_IFC_RQPN_OFFSET);
 
 		val = in_wc->sl << 4;
diff --git a/drivers/infiniband/hw/mthca/mthca_cq.c b/drivers/infiniband/hw/mthca/mthca_cq.c
index 283d50b..5862411 100644
--- a/drivers/infiniband/hw/mthca/mthca_cq.c
+++ b/drivers/infiniband/hw/mthca/mthca_cq.c
@@ -530,7 +530,7 @@ static inline int mthca_poll_one(struct mthca_dev *dev,
 		}
 	}
 
-	entry->qp_num = (*cur_qp)->qpn;
+	entry->qp = &(*cur_qp)->ibqp;
 
 	if (is_send) {
 		wq = &(*cur_qp)->sq;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 0bfa332..54cde37 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -419,8 +419,8 @@ struct ib_wc {
 	enum ib_wc_opcode	opcode;
 	u32			vendor_err;
 	u32			byte_len;
+	struct ib_qp	       *qp;
 	__be32			imm_data;
-	u32			qp_num;
 	u32			src_qp;
 	int			wc_flags;
 	u16			pkey_index;

-- 
MST


From eitan at sw053.yok.mtl.com  Sun Dec 31 21:31:14 2006
From: eitan at sw053.yok.mtl.com (Eitan Zahavi)
Date: Mon, 1 Jan 2007 07:31:14 +0200
Subject: [openib-general] nightly osm_sim report 2007-01-01:normal completion
Message-ID: <200701010531.l015VExN007699@sw053.yok.mtl.com>

OSM Simulation Regression Summary
OpenSM rev = Sat_Dec_30_17:20:32_2006 000033 
ibutils rev = Wed_Dec_27_23:39:30_2006 60aebe 
Total=378 Pass=308 Fail=70

Pass:
42 Stability IS1-16.topo
42 Pkey IS1-16.topo
42 OsmStress IS1-16.topo
42 Multicast IS1-16.topo
42 LidMgr IS1-16.topo
14 Stability IS3-loop.topo
14 Stability IS3-128.topo
14 Pkey IS3-128.topo
14 OsmStress IS3-128.topo
14 Multicast IS3-loop.topo
14 Multicast IS3-128.topo
14 LidMgr IS3-128.topo

Failures:
42 OsmTest IS1-16.topo
14 OsmTest IS3-loop.topo
14 OsmTest IS3-128.topo


From mst at mellanox.co.il  Sun Dec 31 23:03:15 2006
From: mst at mellanox.co.il (Michael S. Tsirkin)
Date: Mon, 1 Jan 2007 09:03:15 +0200
Subject: [openib-general] v2.6.20-rc2 merged into ofed 1.2
Message-ID: <20070101070315.GB25691@mellanox.co.il>

Upstream v2.6.20-rc2 has been merged into ofed 1.2 branch.
There has been no API changes since -rc1, so not backports
need to be updated.

-- 
MST